Re: shared-memory based stats collector

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: shared-memory based stats collector
Дата
Msg-id 20201001.090722.322196928507744460.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: shared-memory based stats collector  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: shared-memory based stats collector  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Список pgsql-hackers
At Fri, 25 Sep 2020 09:27:26 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Thanks for reviewing!
> 
> At Mon, 21 Sep 2020 19:47:04 -0700, Andres Freund <andres@anarazel.de> wrote in 
> > Hi,
> > 
> > On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:
> > > Locks on the shared statistics is acquired by the units of such like
> > > tables, functions so the expected chance of collision are not so high.
> > 
> > I can't really parse that...
> 
> Mmm... Is the following readable?
> 
> Shared statistics locks are acquired by units such as tables,
> functions, etc., so the chances of an expected collision are not so
> high.
> 
> Anyway, this is found to be wrong, so I removed it.

01: (Fixed?)
> > > Furthermore, until 1 second has elapsed since the last flushing to
> > > shared stats, lock failure postpones stats flushing so that lock
> > > contention doesn't slow down transactions.
> > 
> > I think I commented on that before, but to me 1s seems way too low to
> > switch to blocking lock acquisition. What's the reason for such a low
> > limit?
> 
> It was 0.5 seconds previously.  I don't have a clear idea of a
> reasonable value for it. One possible rationale might be to have 1000
> clients each have a writing time slot of 10ms.. So, 10s as the minimum
> interval. I set maximum interval to 60, and retry interval to
> 1s. (Fixed?)

02: (I'd appreciate it if you could suggest the appropriate one.)
> > >      /*
> > > -     * Clean up any dead statistics collector entries for this DB. We always
> > > +     * Clean up any dead activity statistics entries for this DB. We always
> > >       * want to do this exactly once per DB-processing cycle, even if we find
> > >       * nothing worth vacuuming in the database.
> > >       */
> > 
> > What is "activity statistics"?
> 
> I don't get your point. It is formally the replacement word for
> "statistics collector". The "statistics collector (process)" no longer
> exists, so I had to invent a name for the successor mechanism that is
> distinguishable with data/column statistics.  If it is not the proper
> wording, I'd appreciate it if you could suggest the appropriate one.

03: (Fixed. Replaced with far simpler cache implement.)
> > > @@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
> > >      }
> > >
> > >      /* fetch the pgstat table entry */
> > > -    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
> > > -                                         shared, dbentry);
> > > +    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
> > > +                                                   relid);
> > 
> > Why do all of these places deal with a snapshot? For most it seems to
> > make much more sense to just look up the entry and then copy that into
> > local memory?  There may be some place that need some sort of snapshot
> > behaviour that's stable until commit / pgstat_clear_snapshot(). But I
> > can't reallly see many?
> 
> Ok, I reread this thread and agree that there's a (vague) consensus to
> remove the snapshot stuff. Backend-statistics (bestats) still are
> stable during a transaction.

If we nuked the snapshot stuff completely, pgstatfuns.c needed many
additional pfree()s since it calls pgstat_fetch* many times for the
same object.  I choosed to make pgstat_fetch_stat_*() functions return
a result stored in static variables. It doesn't work a transactional
way as before but keeps the last result for a while then invalidated
by transaction end time at most.

04: (Fixed.)
> > > +#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
> > >
> > > +#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
> > > +                                             * updates */
> > 
> > These don't really seem to be in line with the commit message...
> 
> Oops! Sorry. Fixed both of this value and the commit message (and the
> file comment).

05: (The struct is gone.)
> > > + * dshash pgStatSharedHash
> > > + *    -> PgStatHashEntry                (dshash entry)
> > > + *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
> > 
> > I don't like 'Envelope' that much. If I understand you correctly that's
> > a common prefix that's used for all types of stat objects, correct? If
> > so, how about just naming it PgStatEntryBase or such? I think it'd also
> > be useful to indicate in the "are stored as" part that PgStatEnvelope is
> > just the common prefix for an allocation.
> 
> The name makes sense. Thanks! (But the struct is now gone..)

06: (Fixed.)
> > > -typedef struct TabStatHashEntry
> > > +static size_t pgstat_entsize[] =
> > 
> > > +/* Ditto for local statistics entries */
> > > +static size_t pgstat_localentsize[] =
> > > +{
> > > +    0,                            /* PGSTAT_TYPE_ALL: not an entry */
> > > +    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
> > > +    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
> > > +    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
> > > +};
> > 
> > These probably should be const as well.
> 
> Right. Fixed.

07: (Fixed.)
> > >  /*
> > > - * Backends store per-function info that's waiting to be sent to the collector
> > > - * in this hash table (indexed by function OID).
> > > + * Stats numbers that are waiting for flushing out to shared stats are held in
> > > + * pgStatLocalHash,
> > >   */
> > > -static HTAB *pgStatFunctions = NULL;
> > > +typedef struct PgStatHashEntry
> > > +{
> > > +    PgStatHashEntryKey key;        /* hash key */
> > > +    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
> > > +}            PgStatHashEntry;
> > > +
> > > +/* struct for shared statistics entry pointed from shared hash entry. */
> > > +typedef struct PgStatEnvelope
> > > +{
> > > +    PgStatTypes type;            /* statistics entry type */
> > > +    Oid            databaseid;        /* databaseid */
> > > +    Oid            objectid;        /* objectid */
> > 
> > Do we need this information both here and in PgStatHashEntry? It's
> > possible that it's worthwhile, but I am not sure it is.
> 
> Same key values were stored in PgStatEnvelope, PgStat(Local)HashEntry,
> and PgStat_Stats*Entry. And I thought the same while developing. After
> some thoughts, I managed to remove the duplicate values other than
> PgStat(Local)HashEntry. Fixed.

08: (Fixed.)
> > > +    size_t        len;            /* length of body, fixed per type. */
> > 
> > Why do we need this? Isn't that something that can easily be looked up
> > using the type?
> 
> Not only they are virtually fixed values, but they were found to be
> write-only variables. Removed.

09: (Fixed. "Envelope" is embeded in stats entry structs.)
> > > +    LWLock        lock;            /* lightweight lock to protect body */
> > > +    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
> > > +}            PgStatEnvelope;
> > 
> > What you're doing here with 'body' doesn't provide enough guarantees
> > about proper alignment. E.g. if one of the entry types wants to store a
> > double, this won't portably work, because there's platforms that have 4
> > byte alignment for ints, but 8 byte alignment for doubles.
> > 
> > 
> > Wouldn't it be better to instead embed PgStatEnvelope into the struct
> > that's actually stored? E.g. something like
> > 
> > struct PgStat_TableStatus
> > {
> >     PgStatEnvelope header; /* I'd rename the type */
> >     TimestampTz vacuum_timestamp;    /* user initiated vacuum */
> >     ...
> > }
> > 
> > or if you don't want to do that because it'd require declaring
> > PgStatEnvelope in the header (not sure that'd really be worth avoiding),
> > you could just get rid of the body field and just do the calculation
> > using something like MAXALIGN((char *) envelope + sizeof(PgStatEnvelope))
> 
> As the result of the modification so far, there is only one member,
> lock, left in the PgStatEnvelope (or PgStatEntryBase) struct.  I chose
> to embed it to each PgStat_Stat*Entry structs as
> PgStat_StatEntryHeader.


10: (Fixed. Same as #03)
> > > + * Snapshot is stats entry that is locally copied to offset stable values for a
> > > + * transaction.
> ...
> > The amount of code needed for this snapshot stuff seems unreasonable to
> > me, especially because I don't see why we really need it. Is this just
> > so that there's no skew between all the columns of pg_stat_all_tables()
> > etc?
> > 
> > I think this needs a lot more comments explaining what it's trying to
> > achieve.
> 
> I don't insist on keeping the behavior.  Removed snapshot stuff only
> of pgstat stuff. (beentry snapshot is left alone.)

11: (Fixed. Per-entry-type initialize is gone.)
> > > +/*
> > > + * Newly created shared stats entries needs to be initialized before the other
> > > + * processes get access it. get_stat_entry() calls it for the purpose.
> > > + */
> > > +typedef void (*entry_initializer) (PgStatEnvelope * env);
> > 
> > I think we should try to not need it, instead declaring that all fields
> > are zero initialized. That fits well together with my suggestion to
> > avoid duplicating the database / object ids.
> 
> Now that entries don't have type-specific fields that need a special
> care, I removed that stuff altogether.

12: (Fixed. Global stats memories are merged.)
> > > +static void
> > > +attach_shared_stats(void)
> > > +{
> ...
> > > +        shared_globalStats = (PgStat_GlobalStats *)
> > > +            dsa_get_address(area, StatsShmem->global_stats);
> > > +        shared_archiverStats = (PgStat_ArchiverStats *)
> > > +            dsa_get_address(area, StatsShmem->archiver_stats);
> > > +
> > > +        shared_SLRUStats = (PgStatSharedSLRUStats *)
> > > +            dsa_get_address(area, StatsShmem->slru_stats);
> > > +        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
> > 
> > I don't think it makes sense to use dsa allocations for any of the fixed
> > size stats (global_stats, archiver_stats, ...). They should just be
> > direct members of StatsShmem? Then we also don't need the shared_*
> > helper variables
> 
> I intended to reduce the amount of fixed-allocated shared memory, or
> make maximum use of DSA. However, you're right. Now they are members
> of StatsShmem.


13: (I couldn't address this fully..)
> > > +        /* Load saved data if any. */
> > > +        pgstat_read_statsfiles();
> > 
> > Hm. Is it a good idea to do this as part of the shmem init function?
> > That's a lot more work than we normally do in these.
> > 
> > > +/* ----------
> > > + * detach_shared_stats() -
> > > + *
> > > + *    Detach shared stats. Write out to file if we're the last process and told
> > > + *    to do so.
> > > + * ----------
> > >   */
> > >  static void
> > > -pgstat_reset_remove_files(const char *directory)
> > > +detach_shared_stats(bool write_stats)
> > 
> > I think it'd be better to have an explicit call in the shutdown sequence
> > somewhere to write out the data, instead of munging detach and writing
> > stats out together.
> 
> It is actually strange that attach_shared_stats reads file in a
> StatsLock section while it attaches existing shared memory area
> deliberately outside the same lock section. So I moved the call to
> pg_stat_read/write_statsfiles() out of StatsLock section as the first
> step. But I couldn't move pgstat_write_stats_files() out of (or,
> before or after) detach_shared_stats(), because I didn't find a way to
> reliably check if the exiting process is the last detacher by a
> separate function from detach_shared_stats().
> 
> (continued)
> =====

14: (I believe it is addressed.)
> > +    if (nowait)
> > +    {
> > +        /*
> > +         * Don't flush stats too frequently.  Return the time to the next
> > +         * flush.
> > +         */
> 
> I think it's confusing to use nowait in the if when you actually mean
> !force.

Agreed.  I'm hovering between using !force to the parameter "nowait"
of flush_tabstat() or using the relabeled variable nowait.  I choosed
to use nowait in the attached.

15: (Not addressed.)
> > -    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
> > +    if (pgStatLocalHash)
> >      {
> > -        for (i = 0; i < tsa->tsa_used; i++)
> > +        /* Step 1: flush out other than database stats */
...
> > +                case PGSTAT_TYPE_DB:
> > +                    if (ndbentries >= dbentlistlen)
> > +                    {
> > +                        dbentlistlen *= 2;
> > +                        dbentlist = repalloc(dbentlist,
> > +                                             sizeof(PgStatLocalHashEntry *) *
> > +                                             dbentlistlen);
> > +                    }
> > +                    dbentlist[ndbentries++] = lent;
> > +                    break;
> 
> Why do we need this special behaviour for database statistics?

Some of the table stats numbers are also counted as database stats
numbers. It is currently added at stats-sending time (in
pgstat_recv_tabstat()) and this follows that design.  If we add such
table stats numbers to database stats before flushing out table stats,
we need to remember whether that number are already added to database
stats or not yet.

16: (Fixed. Used List.)
> If we need it,it'd be better to just use List here rather than open
> coding a replacement (List these days basically has the same complexity
> as what you do here).

Agreed. (I noticed that lappend is faster than lcons now.) Fixed.

17: (Fixed. case-default is removed, and PGSTAT_TYPE_ALL is removed by #28)
> > +                case PGSTAT_TYPE_TABLE:
> > +                    if (flush_tabstat(lent->env, nowait))
> > +                        remove = true;
> > +                    break;
> > +                case PGSTAT_TYPE_FUNCTION:
> > +                    if (flush_funcstat(lent->env, nowait))
> > +                        remove = true;
> > +                    break;
> > +                default:
> > +                    Assert(false);
> 
> Adding a default here prevents the compiler from issuing a warning when
> new types of stats are added...

Agreed. Another instance of switch on the same enum doesn't have
default:. (Fixed.)

18: (Fixed.)
> > +            /* Remove the successfully flushed entry */
> > +            pfree(lent->env);
> 
> Probably worth zeroing the pointer here, to make debugging a little
> easier.

Agreed. I did the same to another instance of freeing a memory chunk
pointed from non-block-local pointers.

19: (Fixed. LWLocks is replaced with atmoic update.)
> > +    /* Publish the last flush time */
> > +    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> > +    if (shared_globalStats->stats_timestamp < now)
> > +        shared_globalStats->stats_timestamp = now;
> > +    LWLockRelease(StatsLock);
> 
> Ugh, that seems like a fairly unnecessary global lock acquisition. What
> do we need this timestamp for? Not clear to me that it's still
> needed. If it is needed, it'd probably worth making this an atomic and
> doing a compare-exchange loop instead.

The value is exposed via a system view. I used pg_atomic but I didn't
find a clean way to store TimestampTz into pg_atomic_u64.

20: (Wrote a comment to explain the reason.)
> >      /*
> > -     * Send partial messages.  Make sure that any pending xact commit/abort
> > -     * gets counted, even if there are no table stats to send.
> > +     * If we have pending local stats, let the caller know the retry interval.
> >       */
> > -    if (regular_msg.m_nentries > 0 ||
> > -        pgStatXactCommit > 0 || pgStatXactRollback > 0)
> > -        pgstat_send_tabstat(®ular_msg);
> > -    if (shared_msg.m_nentries > 0)
> > -        pgstat_send_tabstat(&shared_msg);
> > +    if (HAVE_ANY_PENDING_STATS())
> 
> I think this needs a comment explaining why we still may have pending
> stats.

Does the following work?

| * Some of the local stats may have not been flushed due to lock
| * contention.  If we have such pending local stats here, let the caller
| * know the retry interval.

21: (Fixed. Local cache of shared stats entry is added.)
> > + * flush_tabstat - flush out a local table stats entry
> > + *
...
> Could we cache the address of the shared entry in the local entry for a
> while? It seems we have a bunch of contention (that I think you're
> trying to address in a prototoype patch posted since) just because we
> will over and over look up the same address in the shared hash table.
> 
> If we instead kept the local hashtable alive for longer and stored a
> pointer to the shared entry in it, we could make this a lot
> cheaper. There would be some somewhat nasty edge cases probably. Imagine
> a table being dropped for which another backend still has pending
> stats. But that could e.g. be addressed with a refcount.

Yeah, I noticed that and did that in the previous version (with a
silly bug..)  The cache is based on the simple hash. All the entries
were dropped after a vacuum removed at least one shared stats entry in
the previous version. However, this version uses refcount and drops
only the entries actually needed to be dropped.

22: (vacuum/analyze immediately writes to shared stats according to #34)
> > +    /* retrieve the shared table stats entry from the envelope */
> > +    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
> > +
> > +    /* lock the shared entry to protect the content, skip if failed */
> > +    if (!nowait)
> > +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> > +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> > +        return false;
> > +
> > +    /* add the values to the shared entry. */
> > +    shtabstats->numscans += lstats->t_counts.t_numscans;
> > +    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
> > +    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
> > +    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
> > +    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
> > +    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
> > +    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
> > +
> > +    /*
> > +     * If table was truncated or vacuum/analyze has ran, first reset the
> > +     * live/dead counters.
> > +     */
> > +    if (lstats->t_counts.t_truncated ||
> > +        lstats->t_counts.vacuum_count > 0 ||
> > +        lstats->t_counts.analyze_count > 0 ||
> > +        lstats->t_counts.autovac_vacuum_count > 0 ||
> > +        lstats->t_counts.autovac_analyze_count > 0)
> > +    {
> > +        shtabstats->n_live_tuples = 0;
> > +        shtabstats->n_dead_tuples = 0;
> > +    }
> 
> > +    /* clear the change counter if requested */
> > +    if (lstats->t_counts.reset_changed_tuples)
> > +        shtabstats->changes_since_analyze = 0;
> 
> I know this is largely old code, but it's not obvious to me that there's
> no race conditions here / that the race condition didn't get worse. What
> prevents other backends to since have done a lot of inserts into this
> table? Especially in case the flushes were delayed due to lock
> contention.

# I noticed that I carelessly dropped inserts_since_vacuum code.

Well. if vacuum report is delayed after a massive insert commit, the
massive insert would be omitted. It seems to me that your suggestion
in #34 below gets the point.

> > +    /*
> > +     * Update vacuum/analyze timestamp and counters, so that the values won't
> > +     * goes back.
> > +     */
> > +    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
> > +        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
> 
> It seems to me that if these branches are indeed a necessary branches,
> my concerns above are well founded...

I'm not sure it is simply a talisman against evil or basing on an
actual trouble, but I don't believe it's possible that a vacuum ends
after another vacuum that started later ends...

23: (ids are no longer stored in duplicate.)
> > +init_tabentry(PgStatEnvelope * env)
> >  {
> > -    int            n;
> > -    int            len;
> > +    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
> > +
> > +    /*
> > +     * If it's a new table entry, initialize counters to the values we just
> > +     * got.
> > +     */
> > +    Assert(env->type == PGSTAT_TYPE_TABLE);
> > +    tabent->tableid = env->objectid;
> 
> It seems over the top to me to have the object id stored in yet another
> place. It's now in the hash entry, in the envelope, and the type
> specific part.

Agreed, and fixed. (See #11 above)

24: (Fixed. Don't check for all-zero of a function stats entry at flush.)
> > +/*
> > + * flush_funcstat - flush out a local function stats entry
> > + *
> > + * If nowait is true, this function returns false on lock failure. Otherwise
> > + * this function always returns true.
> > + *
> > + * Returns true if the entry is successfully flushed out.
> > + */
> > +static bool
> > +flush_funcstat(PgStatEnvelope * env, bool nowait)
> > +{
> > +    /* we assume this inits to all zeroes: */
> > +    static const PgStat_FunctionCounts all_zeroes;
> > +    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
> > +    PgStatEnvelope *shenv;        /* shared stats envelope */
> > +    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
> > +    bool        found;
> > +
> > +    Assert(env->type == PGSTAT_TYPE_FUNCTION);
> > +    localent = (PgStat_BackendFunctionEntry *) &env->body;
> > +
> > +    /* Skip it if no counts accumulated for it so far */
> > +    if (memcmp(&localent->f_counts, &all_zeroes,
> > +               sizeof(PgStat_FunctionCounts)) == 0)
> > +        return true;
> 
> Why would we have an entry in this case?

Right. A function entry was zeroed out in master but the entry is not
created in that case with this patch. Removed it. (Fixed)

25: (Perhaps fixed. I'm not confident, though.)
> > +    /* find shared table stats entry corresponding to the local entry */
> > +    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
> > +                           nowait, init_funcentry, &found);
> > +    /* skip if dshash failed to acquire lock */
> > +    if (shenv == NULL)
> > +        return false;            /* failed to acquire lock, skip */
> > +
> > +    /* retrieve the shared table stats entry from the envelope */
> > +    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
> > +
> > +    /* lock the shared entry to protect the content, skip if failed */
> > +    if (!nowait)
> > +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> > +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> > +        return false;            /* failed to acquire lock, skip */
> 
> It doesn't seem great that we have a separate copy of all of this logic
> again. It seems to me that most of the code here is (or should be)
> exactly the same as in table case. I think only the the below should be
> in here, rather than in common code.

I failed to get the last phrase, but I guess you suggested that I
should factor-out the common code.

> > +/*
> > + * flush_dbstat - flush out a local database stats entry
> > + *
> > + * If nowait is true, this function returns false on lock failure. Otherwise
...
> > +    /* lock the shared entry to protect the content, skip if failed */
> > +    if (!nowait)
> > +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> > +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> > +        return false;
> 
> Dito re duplicating all of this.


26: (Fixed. Now all stats are saved in one file.)
> > +/*
> > + * Create the filename for a DB stat file; filename is output parameter points
> > + * to a character buffer of length len.
> > + */
> > +static void
> > +get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
> > +{
> > +    int            printed;
> > +
> > +    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
> > +    printed = snprintf(filename, len, "%s/db_%u.%s",
> > +                       PGSTAT_STAT_PERMANENT_DIRECTORY,
> > +                       databaseid,
> > +                       tempname ? "tmp" : "stat");
> > +    if (printed >= len)
> > +        elog(ERROR, "overlength pgstat path");
> >  }
> 
> Do we really want database specific storage after all of these changes?
> Seems like there's no point anymore?

Sounds reasonable. Since we no longer keep the file format,
pgstat_read/write_statsfiles() gets far simpler. (Fixed)

27: (Fixed. added CFI to the same kind of loops.)
> > +    dshash_seq_init(&hstat, pgStatSharedHash, false);
> > +    while ((p = dshash_seq_next(&hstat)) != NULL)
> >      {
> > -        Oid            tabid = tabentry->tableid;
> > -
> > -        CHECK_FOR_INTERRUPTS();
> > -
> 
> Given that this could take a while on a database with a lot of objects
> it might worth keeping the CHECK_FOR_INTERRUPTS().

Agreed. It seems like a mistake. (Fixed  pstat_read/write_statsfile()).

28: (Fixed. collect_stat_entries is removed along with PGSTAT_TYPE_ALL.)
> >  /* ----------
> > - * pgstat_vacuum_stat() -
> > + * collect_stat_entries() -
> >   *
> > - *    Will tell the collector about objects he can get rid of.
> > + *    Collect the shared statistics entries specified by type and dbid. Returns a
> > + *  list of pointer to shared statistics in palloc'ed memory. If type is
> > + *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
> > + *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
> > + *  PGSTAT_TYPE_DB entries.
> >   * ----------
> >   */
> > -void
> > -pgstat_vacuum_stat(void)
> > +static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
> >  {
> 
> > -        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
> > +        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
> > +            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
> >              continue;
> 
> I don't like this interface much. Particularly not that it requires
> adding a PGSTAT_TYPE_ALL that's otherwise not needed. And the thing
> where PGSTAT_TYPE_DB doesn't actually works as one would expect isn't
> nice either.

Sounds reasonable. It was annoying that dbid=InvalidOid is a valid
value for this interface. But now that the function is called only
from two places and it is now simpler to use dshash seqscan
directly. The function and the enum item PGSTAT_TYPE_ALL are gone.
(Fixed)

29: (Fixed. collect_stat_entries is gone.)
> > +        if (n >= listlen - 1)
> > +            listlen *= 2;
> > +            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
> > +        envlist[n++] = dsa_get_address(area, p->env);
> >      }
> 
> I'd use List here as well.

So the function no longer exists. (Fixed)

30: (...)
> > +    dshash_seq_term(&hstat);
> 
> Hm, I didn't immediately see which locking makes this safe? Is it just
> that nobody should be attached at this point?

I'm not sure I get your point, but I try to elaborate.

All the callers of collect_stat_entries have been replaced with a bare
loop of dshash_seq_next.

There are two levels of lock here. One is dshash partition lock that
is needed to continue in-partition scan safely. Another is a lock of
stats entry that is pointed from a dshash entry.

---
((PgStatHashEntry) shent).body -(dsa_get_address)-+-> PgStat_StatEntryHeader
                                                  |
((PgStatLocalHashEntry) lent).body ---------------^
---

Dshash scans are used for dropping and resetting stats entries. Entry
dropping is performed in the following steps.

(delete_current_stats_entry())
- Drop the dshash entry (needs exlock of dshash partition).

- If refcount of the stats entry body is already zero, free the memory
   immediately .

- If not, set the "dropped" flag of the body. No lock is required
  because the "dropped" flag won't be even referred to by other
  backends until the next step is done.

- Increment deletion count of the shared hash. (this is used as the
  "age" of local pointer cache hash (pgstat_cache).

(get_stat_entry())

- If dshash deletion count is different from the local cache age, scan
  over the local cache hash to find "dropped" entries.

- Decrements refcount of the dropped entry and free the shared entry
  if it is no longer referenced. Apparently no lock is required.

pgstat_drop_database() and pgstat_vacuum_stat() have concurrent
backends so the locks above are required. pgstat_write_statsfile() is
guaranteed to run alone so it doesn't matter either taking locks or
not.

pgstat_reset_counters() doesn't drop or modify dshash entries so
dshash scan requires shared lock. The stats entry body is updated so it
needs exclusive lock.


31: (Fixed. Use List instead of the open coding.)
> > +void
> > +pgstat_vacuum_stat(void)
> > +{
...
> > +    /* collect victims from shared stats */
> > +    arraylen = 16;
> > +    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
> > +    nvictims = 0;
> 
> Same List comment as before.

The function uses a list now. (Fixed)

32: (Fixed.)
> >  void
> >  pgstat_reset_counters(void)
> >  {
> > -    PgStat_MsgResetcounter msg;
> > +    PgStatEnvelope **envlist;
> > +    PgStatEnvelope **p;
> >
> > -    if (pgStatSock == PGINVALID_SOCKET)
> > -        return;
> > +    /* Lookup the entries of the current database in the stats hash. */
> > +    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
> > +    for (p = envlist; *p != NULL; p++)
> > +    {
> > +        PgStatEnvelope *env = *p;
> > +        PgStat_StatDBEntry *dbstat;
> >
> > -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
> > -    msg.m_databaseid = MyDatabaseId;
> > -    pgstat_send(&msg, sizeof(msg));
> > +        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
> > +
> 
> What locking prevents this entry from being freed between the
> collect_stat_entries() and this LWLockAcquire?

Mmm. They're not protected.  The attached version no longer uses the
intermediate list and the fetched dshash entry is protected by dshash
partition lock.  (Fixed)


33: (Will keep the current code.)
> >  /* ----------
> > @@ -1440,48 +1684,63 @@ pgstat_reset_slru_counter(const char *name)
> >  void
> >  pgstat_report_autovac(Oid dboid)
> >  {
> > -    PgStat_MsgAutovacStart msg;
> > +    PgStat_StatDBEntry *dbentry;
> > +    TimestampTz ts;
> >
> > -    if (pgStatSock == PGINVALID_SOCKET)
> > +    /* return if activity stats is not active */
> > +    if (!area)
> >          return;
> >
> > -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
> > -    msg.m_databaseid = dboid;
> > -    msg.m_start_time = GetCurrentTimestamp();
> > +    ts = GetCurrentTimestamp();
> >
> > -    pgstat_send(&msg, sizeof(msg));
> > +    /*
> > +     * Store the last autovacuum time in the database's hash table entry.
> > +     */
> > +    dbentry = get_local_dbstat_entry(dboid);
> > +    dbentry->last_autovac_time = ts;
> >  }
> 
> Why did you introduce the local ts variable here?

The function used to assign the timestamp within a LWLock section. In
the last version it writes to local entry so the lock was useless but
the amendment following to the comment #34 just below introduces
LWLocks again.

34: (Fixed. Vacuum/analyze write shared stats instantly.)
> >  /* --------
> >   * pgstat_report_analyze() -
> >   *
> > - *    Tell the collector about the table we just analyzed.
> > + *    Report about the table we just analyzed.
> >   *
> >   * Caller must provide new live- and dead-tuples estimates, as well as a
> >   * flag indicating whether to reset the changes_since_analyze counter.
> > @@ -1492,9 +1751,10 @@ pgstat_report_analyze(Relation rel,
> >                        PgStat_Counter livetuples, PgStat_Counter deadtuples,
> >                        bool resetcounter)
> >  {
> >  }
> 
> It seems to me that the analyze / vacuum cases would be much better
> dealth with by synchronously operating on the shared entry, instead of
> going through the local hash table. ISTM that that'd make it a lot

Blocking at the beginning and end of such operations doesn't
matter. Sounds reasonbale.

> going through the local hash table. ISTM that that'd make it a lot
> easier to avoid most of the ordering issues.

Agreed. That avoid at least the case of delayed vacuum report (#22).


35: (Fixed, needing a change of how relcache uses local stats.)
> > +static PgStat_TableStatus *
> > +get_local_tabstat_entry(Oid rel_id, bool isshared)
> > +{
> > +    PgStatEnvelope *env;
> > +    PgStat_TableStatus *tabentry;
> > +    bool        found;
> >
> > -    /*
> > -     * Now we can fill the entry in pgStatTabHash.
> > -     */
> > -    hash_entry->tsa_entry = entry;
> > +    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
> > +                               isshared ? InvalidOid : MyDatabaseId,
> > +                               rel_id, true, &found);
> >
> > -    return entry;
> > +    tabentry = (PgStat_TableStatus *) &env->body;
> > +
> > +    if (!found)
> > +    {
> > +        tabentry->t_id = rel_id;
> > +        tabentry->t_shared = isshared;
> > +        tabentry->trans = NULL;
> > +        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
> > +        tabentry->vacuum_timestamp = 0;
> > +        tabentry->autovac_vacuum_timestamp = 0;
> > +        tabentry->analyze_timestamp = 0;
> > +        tabentry->autovac_analyze_timestamp = 0;
> > +    }
> > +
> 
> As with shared entries, I think this should just be zero initialized
> (and we should try to get rid of the duplication of t_id/t_shared).

Ah! Yeah, they are removable since we already converted them into the
key of hash entry.  Removed oids and the intialization code from all
types of local stats entry types.

One annoyance doing that was pgstat_initstats, which assumes the
pgstat_info linked from relation won't be freed.  Finally I tightned
up the management of pgstat_info link. The link between relcache and
table stats entry is now a bidirectional link and explicitly de-linked
by a new function pgstat_delinkstats().


36: (Perhaps fixed. I'm not confident, though.)
> > +    return tabentry;
> >  }
> >
> > +
> >  /*
> >   * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
> >   *
> > - * If no entry, return NULL, don't create a new one
> > + *  Find any existing PgStat_TableStatus entry for rel from the current
> > + *  database then from shared tables.
> 
> What do you mean with "from the current database then from shared
> tables"?

It is rewritten as the following, is this readable?

| *  Find any existing PgStat_TableStatus entry for rel_id in the current
| *  database. If not found, try finding from shared tables.

37: (Maybe fixed.)
> >  void
> > -pgstat_send_archiver(const char *xlog, bool failed)
> > +pgstat_report_archiver(const char *xlog, bool failed)
> >  {
..
> > +    if (failed)
> > +    {
> > +        /* Failed archival attempt */
> > +        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> > +        ++shared_archiverStats->failed_count;
> > +        memcpy(shared_archiverStats->last_failed_wal, xlog,
> > +               sizeof(shared_archiverStats->last_failed_wal));
> > +        shared_archiverStats->last_failed_timestamp = now;
> > +        LWLockRelease(StatsLock);
> > +    }
> > +    else
> > +    {
> > +        /* Successful archival operation */
> > +        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> > +        ++shared_archiverStats->archived_count;
> > +        memcpy(shared_archiverStats->last_archived_wal, xlog,
> > +               sizeof(shared_archiverStats->last_archived_wal));
> > +        shared_archiverStats->last_archived_timestamp = now;
> > +        LWLockRelease(StatsLock);
> > +    }
> >  }
> 
> Huh, why is this duplicating near equivalent code?

To avoid branches within a lock section, or since it is simply
expanded from the master. They can be reset by backends so I couldn't
change it to use changecount protocol. So it still uses LWLock but the
common code is factored out in the attached version.

In connection with this, While I was looking at bgwriter and
checkpointer to see if the statistics of the two could be split, I
found the following comment in checkpoiner.c.

| * Send off activity statistics to the activity stats facility.  (The
| * reason why we re-use bgwriter-related code for this is that the
| * bgwriter and checkpointer used to be just one process.  It's
| * probably not worth the trouble to split the stats support into two
| * independent stats message types.)

So I split the two to try getting rid of LWLock for the global stats,
but resetting counter prevented me from doing that. In the attached
version, I left it as it is because I've done it..


38: (Haven't addressed.)
> >  /* ----------
> >   * pgstat_write_statsfiles() -
> > - *        Write the global statistics file, as well as requested DB files.
> > - *
> > - *    'permanent' specifies writing to the permanent files not temporary ones.
> > - *    When true (happens only when the collector is shutting down), also remove
> > - *    the temporary files so that backends starting up under a new postmaster
> > - *    can't read old data before the new collector is ready.
> > - *
> > - *    When 'allDbs' is false, only the requested databases (listed in
> > - *    pending_write_requests) will be written; otherwise, all databases
> > - *    will be written.
> > + *        Write the global statistics file, as well as DB files.
> >   * ----------
> >   */
> > -static void
> > -pgstat_write_statsfiles(bool permanent, bool allDbs)
> > +void
> > +pgstat_write_statsfiles(void)
> >  {
> 
> Whats the locking around this?

No locking is used there. The code is (currently) guaranteed to be the
only process that reads it.  Added a comment and an assertion.  I did
the same to pgstat_read_statsfile().


39: (Fixed.)
> > -pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> > +pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
> >  {
> > -    HASH_SEQ_STATUS tstat;
> > -    HASH_SEQ_STATUS fstat;
> > -    PgStat_StatTabEntry *tabentry;
> > -    PgStat_StatFuncEntry *funcentry;
> > +    PgStatEnvelope **envlist;
> > +    PgStatEnvelope **penv;
> >      FILE       *fpout;
> >      int32        format_id;
> >      Oid            dbid = dbentry->databaseid;
> > @@ -5048,8 +4974,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> >      char        tmpfile[MAXPGPATH];
> >      char        statfile[MAXPGPATH];
> >
> > -    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
> > -    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
> > +    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
> > +    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
> >
> >      elog(DEBUG2, "writing stats file \"%s\"", statfile);
> >
> > @@ -5076,24 +5002,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> >      /*
> >       * Walk through the database's access stats per table.
> >       */
> > -    hash_seq_init(&tstat, dbentry->tables);
> > -    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
> > +    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
> > +    for (penv = envlist; *penv != NULL; penv++)
> 
> In several of these collect_stat_entries() callers it really bothers me
> that we basically allocate an array as large as the number of objects
> in the database (That's fine for databases, but for tables...). Without
> much need as far as I can see.

collect_stat_entries() is removed (#28) and the callers now handles
entries directly in the dshash_seq_next loop.

40: (Fixed.)
> >      {
> > +        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
> > +
> >          fputc('T', fpout);
> >          rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
> >          (void) rc;                /* we'll check for error with ferror */
> >      }
> > +    pfree(envlist);
> >
> >      /*
> >       * Walk through the database's function stats table.
> >       */
> > -    hash_seq_init(&fstat, dbentry->functions);
> > -    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
> > +    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
> > +    for (penv = envlist; *penv != NULL; penv++)
> >      {
> > +        PgStat_StatFuncEntry *funcentry =
> > +        (PgStat_StatFuncEntry *) &(*penv)->body;
> > +
> >          fputc('F', fpout);
> >          rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
> >          (void) rc;                /* we'll check for error with ferror */
> >      }
> > +    pfree(envlist);
> 
> Why do we need separate loops for every type of object here?

Just to keep the file format. But we decided to change it (#26) and it
is now a juble of all kinds of stats
entries. pgstat_write/read_statsfile() become far simpler.


41: (Fixed.)
> > +/* ----------
> > + * create_missing_dbentries() -
> > + *
> > + *  There may be the case where database entry is missing for the database
> > + *  where object stats are recorded. This function creates such missing
> > + *  dbentries so that so that all stats entries can be written out to files.
> > + * ----------
> > + */
> > +static void
> > +create_missing_dbentries(void)
> > +{
> 
> In which situation is this necessary?

It is because the old file format required that entries. It is no
longer needed and removed in #26.


42: (Sorry, but I didn't get your point..)
> > +static PgStatEnvelope *
> > +get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
> > +               bool nowait, entry_initializer initfunc, bool *found)
> > +{
> 
> > +    bool        create = (initfunc != NULL);
> > +    PgStatHashEntry *shent;
> > +    PgStatEnvelope *shenv = NULL;
> > +    PgStatHashEntryKey key;
> > +    bool        myfound;
> > +
> > +    Assert(type != PGSTAT_TYPE_ALL);
> > +
> > +    key.type = type;
> > +    key.databaseid = dbid;
> > +    key.objectid = objid;
> > +    shent = dshash_find_extended(pgStatSharedHash, &key,
> > +                                 create, nowait, create, &myfound);
> > +    if (shent)
> >      {
> > -        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
> > +        if (create && !myfound)
> > +        {
> > +            /* Create new stats envelope. */
> > +            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
> > +            dsa_pointer chunk = dsa_allocate0(area, envsize);
> 
> > +            /*
> > +             * The lock on dshsh is released just after. Call initializer
> > +             * callback before it is exposed to other process.
> > +             */
> > +            if (initfunc)
> > +                initfunc(shenv);
> > +
> > +            /* Link the new entry from the hash entry. */
> > +            shent->env = chunk;
> > +        }
> > +        else
> > +            shenv = dsa_get_address(area, shent->env);
> > +
> > +        dshash_release_lock(pgStatSharedHash, shent);
> 
> Doesn't this mean that by this time the entry could already have been
> removed by a concurrent backend, and the dsa allocation freed?

Does "by this time" mean before the dshash_find_extended, or after it
and until dshash_release_lock?

We can create an entry for a just droppted object but it should be
removed again by the next vacuum.

The newly created entry (or its partition) is exclusively locked so no
concurrent backend does not find it until the dshash_release_lock.

The shenv could be removed until the caller accesses it. But since the
function is requested for an existing object, that cannot be removed
until the first vacuum after the transaction end. I added a comment
just before the dshash_release_lock in get_stat_entry().


43: (Fixed. But has a side effect.)
> > Subject: [PATCH v36 7/7] Remove the GUC stats_temp_directory
> >
> > The GUC used to specify the directory to store temporary statistics
> > files. It is no longer needed by the stats collector but still used by
> > the programs in bin and contrib, and maybe other extensions. Thus this
> > patch removes the GUC but some backing variables and macro definitions
> > are left alone for backward compatibility.
> 
> I don't see what this achieves? Which use of those variables / macros
> would would be safe? I think it'd be better to just remove them.

pg_stat_statements used PG_STAT_TMP directory to store a temporary
file. I just replaced it with PGSTAT_STAT_PERMANENT_DIRECTORY.  As the
result basebackup copies the temporary file of pg_stat_statements.

By the way, basebackup exludes pg_stat_tmp diretory but sends pg_stat
direcoty. On the other hand when we start a server from a base backup,
it starts crash recovery first and removes stats files anyway. Why
does basebackup send pg_stat direcoty then? (Added as 0007.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 1ba12492bec139dffff2b1aa61468af7f2eca8e8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v38 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..b829167872 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..c337099061 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.4

From 6225db225affd612a27e2c4dac95135ba1d7484e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v38 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b829167872..9c90096f3d 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c337099061..493e974832 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.4

From 6db81f15b246ab3ff5bcb2f1855108e09d3b73be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v38 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index cae93ab69d..6908bec2f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 76b2f5066f..81bfaea869 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ed1b65358d..e3a520def9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 959e3b8873..b811c961a6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -555,6 +555,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1800,7 +1801,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3054,7 +3055,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3189,20 +3190,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3450,7 +3447,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3655,6 +3652,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3951,6 +3960,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5230,7 +5240,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5275,16 +5285,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5526,6 +5526,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 88566bd9fa..746bed773e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9c9a50ae45..de20520b8c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -345,6 +345,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.4

From 525bdc9aaf73795afbbea1dc64e80591a73fedbb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v38 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5379 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   83 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  655 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2349 insertions(+), 4136 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..49df584a9e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1061,8 +1061,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168d..3cb6e20ed5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -599,7 +599,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 79a77ebbfe..b40f85e635 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8577,9 +8577,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    CheckPointerStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    CheckPointerStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0974f3e23a..9507fb8210 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1688,28 +1688,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ddeec870d8..c5477ff567 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..773b82be3b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1680,12 +1677,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1957,8 +1954,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1977,17 +1972,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2030,9 +2019,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2111,8 +2097,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2195,8 +2181,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2755,29 +2741,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2798,17 +2761,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2832,8 +2790,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2955,7 +2913,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2965,8 +2923,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b6..b075e85839 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3e7dcd4f76..957537b6a2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,14 +495,8 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -708,9 +702,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1255,8 +1249,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e3a520def9..6d88c65d5f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836..c7f0503d81 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (600000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,32 +35,25 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -73,35 +61,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            600000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +89,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,16 +103,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -159,73 +126,216 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+
+    /* stats */
+    PgStat_Archiver            archiver_stats;
+    PgStat_BgWriter            bgwriter_stats;
+    PgStat_CheckPointer        checkpointer_stats;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION    /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),/* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)/* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of shared entry. Use these macro to
+ * know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from ShmemStats->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -262,11 +372,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -275,20 +382,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -297,37 +393,52 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -337,486 +448,582 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
+    {
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
+    }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
     /*
-     * Create the UDP socket for sending and receiving statistic messages
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
      */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        bool hold_off;
+
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
+
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        StatsShmem->refcount = 1;
+    }
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockRelease(StatsLock);
 
+    if (area)
+    {
         /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
+         * We're the first attacher process, read stats file while blocking
+         * successors.
          */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
     }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    Assert(pgStatEntHash == NULL);
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+/* ----------
+ * get_stat_entry() -
  *
- *    Returns PID of child process, or 0 if fail.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
  */
-int
-pgstat_start(void)
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey        key;
+    bool                    myfound;
 
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
 
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    if (pgStatEntHash)
+    {
+        uint64 currage;
 
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &myfound);
+    if (shhashent)
     {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+        if (create && !myfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &myfound);
+
+            Assert(!myfound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = myfound;
+
+    return shheader;
+}
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
+
+    if (!have_slrustats)
+        return true;
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
 
-        default:
-            return (int) pgStatPid;
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
 {
-    last_pgstat_start_time = 0;
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
+}
+
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
+{
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -824,147 +1031,386 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        PgStatLocalHashEntry   *lent;
+        ListCell               *lc;
+        int                        remains = 0;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
+
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -972,282 +1418,130 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    nvictims = 0;
 
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            dbid = dbentry->databaseid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
         /*
-         * Not there, so add this table's Oid to the message
+         * Don't drop entries for other than database objects not of the
+         * current database.
          */
-        msg.m_tableid[msg.m_nentries++] = tabid;
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
 
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        switch (ent->key.type)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
         }
-    }
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
 
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1256,81 +1550,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1339,20 +1613,46 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,29 +1661,42 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
+    TimestampTz now = GetCurrentTimestamp();
+    bool        is_archiver;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    is_archiver = (strcmp(target, "archiver") == 0);
 
-    if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
-    else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
-    else
+    if (!is_archiver && strcmp(target, "bgwriter") != 0)
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    if (is_archiver)
+    {
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        StatsShmem->archiver_stats.stat_reset_timestamp = now;
+        cached_archiverstats_is_valid = false;
+    }
+    else
+    {
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
+        StatsShmem->bgwriter_stats.stat_reset_timestamp = now;
+        cached_bgwriterstats_is_valid = false;
+        cached_checkpointerstats_is_valid = false;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1392,17 +1705,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1418,15 +1751,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1440,48 +1798,93 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1492,9 +1895,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1502,10 +1907,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1523,154 +1928,176 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1685,25 +2112,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1717,31 +2128,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1782,9 +2199,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1796,8 +2210,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1813,7 +2226,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1824,121 +2238,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2353,8 +2706,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2369,8 +2720,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2415,7 +2766,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2451,7 +2802,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2471,85 +2822,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2558,30 +2962,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2641,39 +3061,84 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&cached_archiverstats, &StatsShmem->archiver_stats,
+           sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
 
-    return &archiverStats;
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&cached_bgwriterstats, &StatsShmem->bgwriter_stats,
+           sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    cached_bgwriterstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_bgwriterstats;
 }
 
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&cached_checkpointerstats, &StatsShmem->checkpointer_stats,
+           sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    cached_checkpointerstats_is_valid = true;
+
+    return &cached_checkpointerstats;
+}
 
 /*
  * ---------
@@ -2686,9 +3151,27 @@ pgstat_fetch_global(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 
@@ -2902,8 +3385,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3079,12 +3562,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3097,7 +3583,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3114,6 +3600,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    cleanup_dropped_stats_entries();
+
+    detach_shared_stats(true);
 }
 
 
@@ -3374,7 +3864,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3669,8 +4160,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4324,94 +4815,62 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4420,473 +4879,113 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_checkpointer() -
  *
- *        Send SLRU statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
 
     /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid taking lock for a completely empty stats.
      */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_CheckPointer)) == 0)
+        return;
 
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+    LWLockRelease(StatsLock);
 
     /*
-     * Save the final stats to reuse at next startup.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/*
- * Subroutine to clear stats in a database entry
+/* ----------
+ * get_local_dbstat_entry() -
  *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4906,7 +5005,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4918,182 +5017,61 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Checkpointer global stats struct
+     */
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatDBEntry *dbentry;
+        void               *pent;
+        size_t                len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        pent = dsa_get_address(area, ps->body);
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            dbentry = (PgStat_StatDBEntry *) pent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Exclude header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(pent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5127,102 +5105,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5231,624 +5170,137 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * Read global stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
+     * Read checkpointer stats struct
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        ereport(LOG,
+                (errmsg("0corrupted statistics file \"%s\"", statfile)));
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -5865,795 +5317,23 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_slrustats_is_valid = false;
 }
 
 /*
@@ -6744,7 +5424,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -6755,7 +5435,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -6765,41 +5445,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b811c961a6..526021def2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -257,7 +257,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -518,7 +517,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1340,12 +1338,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1794,11 +1786,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2728,8 +2715,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3056,8 +3041,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3124,13 +3107,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3203,22 +3179,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3679,22 +3639,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3914,8 +3858,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3939,8 +3881,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3950,8 +3891,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4152,8 +4092,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5130,18 +5068,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5260,12 +5186,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6170,7 +6090,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6226,8 +6145,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6462,7 +6379,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..57531d7d48 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1556,8 +1556,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..5ee7110444 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2045,7 +2045,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2155,7 +2155,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2345,7 +2345,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2353,7 +2353,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..0dec4b9145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -415,8 +415,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 411cfadbff..5043736f1f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4179,11 +4186,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4215,6 +4223,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4227,8 +4237,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4263,7 +4278,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4271,6 +4286,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 95738a4e34..f6dc875a25 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1632,69 +1629,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1965,7 +1964,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2039,7 +2038,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9061af81a3..d23cc2d0a9 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -71,6 +71,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2353,6 +2354,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ed2ab4b5b2..74fb22f216 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index d4ab4c7e23..4ff4cc33d9 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1245,6 +1248,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b84..29eb459e35 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4310,7 +4310,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4646,7 +4646,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..668a2d033a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -579,7 +579,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..26603e95e4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..046bf21485 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +29,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -41,35 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -80,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -118,13 +90,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -155,10 +120,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -183,308 +151,57 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
+/*
+ * Archiver statistics kept in the shared stats
  */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
+typedef struct PgStat_Archiver
 {
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgBgWriter
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-typedef struct PgStat_MsgSLRU
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -500,7 +217,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -516,98 +232,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -616,13 +242,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -632,7 +254,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -642,29 +263,87 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -684,25 +363,21 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
@@ -710,57 +385,6 @@ typedef struct PgStat_StatFuncEntry
 } PgStat_StatFuncEntry;
 
 
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
-
 /* ----------
  * Backend states
  * ----------
@@ -808,7 +432,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1060,7 +684,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1257,13 +881,21 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1278,29 +910,22 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1340,6 +965,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1462,8 +1088,9 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1472,12 +1099,17 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
@@ -1489,5 +1121,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..3b03464a1a 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.4

From 0fa7e49eddda07ef0a1f6d3903ff16081fc2d4fd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v38 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index de9bacd34f..69db5afc94 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9209,9 +9209,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8eabf93834..cc5dc1173f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7192,11 +7192,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7212,14 +7212,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7250,9 +7249,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8313,7 +8312,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8325,9 +8324,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 42f01c515f..ec02e72dc0 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2367,12 +2367,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e0193a967..7a04d58a1a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -620,7 +610,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1057,10 +1047,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1815,6 +1801,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5738,9 +5728,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index e09ed0a4c3..71bb24accf 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1290,11 +1290,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.4

From 9e3c966f9221b724d805b125481064be46b564d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v38 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..5eaceb60a7 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -88,14 +88,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index b9331830f7..5096963234 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cc5dc1173f..d8d99bb546 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7305,25 +7305,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7f0503d81..6e2053e73d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -98,16 +98,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 57531d7d48..25eabbb1ad 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1365,17 +1340,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 29eb459e35..87296bf2aa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -558,8 +557,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4309,17 +4306,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11608,35 +11594,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 668a2d033a..7183c08305 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,7 +586,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 118b282d1c..9e5a3a01ed 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 1abc257177..d2192429bc 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -53,13 +53,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 046bf21485..44738d4aed 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,9 +32,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bffa2b..bb5474b878 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.4

From b9092abc52bc0839f0a12e03f05581178aa08cef Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v38 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 25eabbb1ad..dd35920e82 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.18.4


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fujii Masao
Дата:
Сообщение: Re: New statistics for tuning WAL buffer size
Следующее
От: Robert Haas
Дата:
Сообщение: Re: BUG #16419: wrong parsing BC year in to_date() function