Обсуждение: [HACKERS] Protect syscache from bloating with negative cache entries

Поиск
Список
Период
Сортировка

[HACKERS] Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello, recently one of my customer stumbled over an immoderate
catcache bloat.

This is a known issue living on the Todo page in the PostgreSQL
wiki.

https://wiki.postgresql.org/wiki/Todo#Cache_Usage
> Fix memory leak caused by negative catcache entries

https://www.postgresql.org/message-id/51C0A1FF.2050404@vmware.com

This patch addresses the two cases of syscache bloat by using
invalidation callback mechanism.


Overview of the patch

The bloat is caused by negative cache entries in catcaches. They
are crucial for performance but it is a problem that there's no
way to remove them. They last for the backends' lifetime.

The first patch provides a means to flush catcache negative
entries, then defines a relcache invalidation callback to flush
negative entries in syscaches for pg_statistic(STATRELATTINH) and
pg_attributes(ATTNAME, ATTNUM).  The second patch implements a
syscache invalidation callback so that deletion of a schema
causes a flush for pg_class (RELNAMENSP).

Both of the aboves are not hard-coded and defined in cacheinfo
using additional four members.


Remaining problems

Still, catcache can bloat by repeatedly accessing non-existent
table with unique names in a permanently-living schema but it
seems a bit too artificial (or malicious). Since such negative
entries don't have a trigger to remove, caps are needed to
prevent them from bloating syscaches, but the limits are hardly
seem reasonably determinable.


Defects or disadvantages

This patch scans over whole the target catcache to find negative
entries to remove and it might take a (comparably) long time on a
catcache with so many entries. By the second patch, unrelated
negative caches may be involved in flushing since they are keyd
by hashvalue, not by the exact key values.



The attached files are the following.

1. 0001-Cleanup-negative-cache-of-pg_statistic-when-dropping.patch  Negative entry flushing by relcache invalidation
using relcache invalidation callback.
 

2. 0002-Cleanup-negative-cache-of-pg_class-when-dropping-a-s.patch  Negative entry flushing by catcache invalidation
using catcache invalidation callback.
 

3. gen.pl  a test script for STATRELATTINH bloating.

4. gen2.pl  a test script for RELNAMENSP bloating.

3 and 4 are used as the following,

./gen.pl | psql postgres > /dev/null



regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Mon, Dec 19, 2016 at 6:15 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello, recently one of my customer stumbled over an immoderate
> catcache bloat.

This isn't only an issue for negative catcache entries.  A long time
ago, there was a limit on the size of the relcache, which was removed
because if you have a workload where the working set of relations is
just larger than the limit, performance is terrible.  But the problem
now is that backend memory usage can grow without bound, and that's
also bad, especially on systems with hundreds of long-lived backends.
In connection-pooling environments, the problem is worse, because
every connection in the pool eventually caches references to
everything of interest to any client.

Your patches seem to me to have some merit, but I wonder if we should
also consider having a time-based threshold of some kind.  If, say, a
backend hasn't accessed a catcache or relcache entry for many minutes,
it becomes eligible to be flushed.  We could implement this by having
some process, like the background writer,
SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
every 10 minutes or so.  When a process receives this signal, it sets
a flag that is checked before going idle.  When it sees the flag set,
it makes a pass over every catcache and relcache entry.  All the ones
that are unmarked get marked, and all of the ones that are marked get
removed.  Access to an entry clears any mark.  So anything that's not
touched for more than 10 minutes starts dropping out of backend
caches.

Anyway, that would be a much bigger change from what you are proposing
here, and what you are proposing here seems reasonable so I guess I
shouldn't distract from it.  Your email just made me think of it,
because I agree that catcache/relcache bloat is a serious issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Craig Ringer
Дата:
On 20 December 2016 at 21:59, Robert Haas <robertmhaas@gmail.com> wrote:

> We could implement this by having
> some process, like the background writer,
> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
> every 10 minutes or so.

... on a rolling basis.

Otherwise that'll be no fun at all, especially with some of those
lovely "we kept getting errors so we raised max_connections to 5000"
systems out there. But also on more sensibly configured ones that're
busy and want nice smooth performance without stalls.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Craig Ringer <craig@2ndquadrant.com> writes:
> On 20 December 2016 at 21:59, Robert Haas <robertmhaas@gmail.com> wrote:
>> We could implement this by having
>> some process, like the background writer,
>> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
>> every 10 minutes or so.

> ... on a rolling basis.

I don't understand why we'd make that a system-wide behavior at all,
rather than expecting each process to manage its own cache.
        regards, tom lane



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> On 20 December 2016 at 21:59, Robert Haas <robertmhaas@gmail.com> wrote:
>>> We could implement this by having
>>> some process, like the background writer,
>>> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system
>>> every 10 minutes or so.
>
>> ... on a rolling basis.
>
> I don't understand why we'd make that a system-wide behavior at all,
> rather than expecting each process to manage its own cache.

Individual backends don't have a really great way to do time-based
stuff, do they?  I mean, yes, there is enable_timeout() and friends,
but I think that requires quite a bit of bookkeeping.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't understand why we'd make that a system-wide behavior at all,
>> rather than expecting each process to manage its own cache.

> Individual backends don't have a really great way to do time-based
> stuff, do they?  I mean, yes, there is enable_timeout() and friends,
> but I think that requires quite a bit of bookkeeping.

If I thought that "every ten minutes" was an ideal way to manage this,
I might worry about that, but it doesn't really sound promising at all.
Every so many queries would likely work better, or better yet make it
self-adaptive depending on how much is in the local syscache.

The bigger picture here though is that we used to have limits on syscache
size, and we got rid of them (commit 8b9bc234a, see also
https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us)
not only because of the problem you mentioned about performance falling
off a cliff once the working-set size exceeded the arbitrary limit, but
also because enforcing the limit added significant overhead --- and did so
whether or not you got any benefit from it, ie even if the limit is never
reached.  Maybe the present patch avoids imposing a pile of overhead in
situations where no pruning is needed, but it doesn't really look very
promising from that angle in a quick once-over.

BTW, I don't see the point of the second patch at all?  Surely, if
an object is deleted or updated, we already have code that flushes
related catcache entries.  Otherwise the caches would deliver wrong
data.
        regards, tom lane



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Tue, Dec 20, 2016 at 3:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't understand why we'd make that a system-wide behavior at all,
>>> rather than expecting each process to manage its own cache.
>
>> Individual backends don't have a really great way to do time-based
>> stuff, do they?  I mean, yes, there is enable_timeout() and friends,
>> but I think that requires quite a bit of bookkeeping.
>
> If I thought that "every ten minutes" was an ideal way to manage this,
> I might worry about that, but it doesn't really sound promising at all.
> Every so many queries would likely work better, or better yet make it
> self-adaptive depending on how much is in the local syscache.

I don't think "every so many queries" is very promising at all.
First, it has the same problem as a fixed cap on the number of
entries: if you're doing a round-robin just slightly bigger than that
value, performance will be poor.  Second, what's really important here
is to keep the percentage of wall-clock time spent populating the
system caches small.  If a backend is doing 4000 queries/second and
each of those 4000 queries touches a different table, it really needs
a cache of at least 4000 entries or it will thrash and slow way down.
But if it's doing a query every 10 minutes and those queries
round-robin between 4000 different tables, it doesn't really need a
4000-entry cache.  If those queries are long-running, the time to
repopulate the cache will only be a tiny fraction of runtime.  If the
queries are short-running, then the effect is, percentage-wise, just
the same as for the high-volume system, but in practice it isn't
likely to be felt as much.  I mean, if we keep a bunch of old cache
entries around on a mostly-idle backend, they are going to be pushed
out of CPU caches and maybe even paged out.  One can't expect a
backend that is woken up after a long sleep to be quite as snappy as
one that's continuously active.

Which gets to my third point: anything that's based on number of
queries won't do anything to help the case where backends sometimes go
idle and sit there for long periods.  Reducing resource utilization in
that case would be beneficial.  Ideally I'd like to get rid of not
only the backend-local cache contents but the backend itself, but
that's a much harder project.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Kyotaro HORIGUCHI
Дата:
Thank you for the discussion.

At Tue, 20 Dec 2016 15:10:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <23492.1482264621@sss.pgh.pa.us>
> The bigger picture here though is that we used to have limits on syscache
> size, and we got rid of them (commit 8b9bc234a, see also
> https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us)
> not only because of the problem you mentioned about performance falling
> off a cliff once the working-set size exceeded the arbitrary limit, but
> also because enforcing the limit added significant overhead --- and did so
> whether or not you got any benefit from it, ie even if the limit is never
> reached.  Maybe the present patch avoids imposing a pile of overhead in
> situations where no pruning is needed, but it doesn't really look very
> promising from that angle in a quick once-over.

Indeed. As mentioned in the mail at the beginning of this thread,
it hits the whole-cache scanning if at least one negative cache
exists even it is not in a relation with the target relid, and it
can be significantly long on a fat cache.

Lists of negative entries like CatCacheList would help but needs
additional memeory.

> BTW, I don't see the point of the second patch at all?  Surely, if
> an object is deleted or updated, we already have code that flushes
> related catcache entries.  Otherwise the caches would deliver wrong
> data.

Maybe you take the patch wrongly. Negative entires won't be
flushed by any means. Deletion of a namespace causes cascaded
object deletion according to dependency then finaly goes to
non-neative cache invalidation. But a removal of *negative
entries* in RELNAMENSP won't happen.

The test script for the case (gen2.pl) does the following thing,

CREATE SCHEMA foo;
SELECT * FROM foo.invalid;
DROP SCHEMA foo;

Removing the schema foo leaves a negative cache entry for
'foo.invalid' in RELNAMENSP.

However, I'm not sure the above situation happens so frequent
that it is worthwhile to amend.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Kyotaro HORIGUCHI
Дата:
At Wed, 21 Dec 2016 10:21:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161221.102109.51106943.horiguchi.kyotaro@lab.ntt.co.jp>
> At Tue, 20 Dec 2016 15:10:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <23492.1482264621@sss.pgh.pa.us>
> > The bigger picture here though is that we used to have limits on syscache
> > size, and we got rid of them (commit 8b9bc234a, see also
> > https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us)
> > not only because of the problem you mentioned about performance falling
> > off a cliff once the working-set size exceeded the arbitrary limit, but
> > also because enforcing the limit added significant overhead --- and did so
> > whether or not you got any benefit from it, ie even if the limit is never
> > reached.  Maybe the present patch avoids imposing a pile of overhead in
> > situations where no pruning is needed, but it doesn't really look very
> > promising from that angle in a quick once-over.
> 
> Indeed. As mentioned in the mail at the beginning of this thread,
> it hits the whole-cache scanning if at least one negative cache
> exists even it is not in a relation with the target relid, and it
> can be significantly long on a fat cache.
> 
> Lists of negative entries like CatCacheList would help but needs
> additional memeory.
> 
> > BTW, I don't see the point of the second patch at all?  Surely, if
> > an object is deleted or updated, we already have code that flushes
> > related catcache entries.  Otherwise the caches would deliver wrong
> > data.
> 
> Maybe you take the patch wrongly. Negative entires won't be
> flushed by any means. Deletion of a namespace causes cascaded
> object deletion according to dependency then finaly goes to
> non-neative cache invalidation. But a removal of *negative
> entries* in RELNAMENSP won't happen.
> 
> The test script for the case (gen2.pl) does the following thing,
> 
> CREATE SCHEMA foo;
> SELECT * FROM foo.invalid;
> DROP SCHEMA foo;
> 
> Removing the schema foo leaves a negative cache entry for
> 'foo.invalid' in RELNAMENSP.
> 
> However, I'm not sure the above situation happens so frequent
> that it is worthwhile to amend.

Since 1753b1b conflicts this patch, I rebased this onto the
current master HEAD. I'll register this to the next CF.

The points of discussion are the following, I think.

1. The first patch seems working well. It costs the time to scan  the whole of a catcache that have negative entries
forother  reloids. However, such negative entries are created by rather  unusual usages. Accesing to undefined columns,
andaccessing  columns on which no statistics have created. The  whole-catcache scan occurs on ATTNAME, ATTNUM and
STATRELATTINHfor every invalidation of a relcache entry.
 

2. The second patch also works, but flushing negative entries by  hash values is inefficient. It scans the bucket
corresponding to given hash value for OIDs, then flushing negative entries  iterating over all the collected OIDs. So
thiscosts more time  than 1 and flushes involving entries that is not necessary to  be removed. If this feature is
valuablebut such side effects  are not acceptable, new invalidation category based on  cacheid-oid pair would be
needed.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Michael Paquier
Дата:
On Wed, Dec 21, 2016 at 5:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If I thought that "every ten minutes" was an ideal way to manage this,
> I might worry about that, but it doesn't really sound promising at all.
> Every so many queries would likely work better, or better yet make it
> self-adaptive depending on how much is in the local syscache.
>
> The bigger picture here though is that we used to have limits on syscache
> size, and we got rid of them (commit 8b9bc234a, see also
> https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us)
> not only because of the problem you mentioned about performance falling
> off a cliff once the working-set size exceeded the arbitrary limit, but
> also because enforcing the limit added significant overhead --- and did so
> whether or not you got any benefit from it, ie even if the limit is never
> reached.  Maybe the present patch avoids imposing a pile of overhead in
> situations where no pruning is needed, but it doesn't really look very
> promising from that angle in a quick once-over.

Have there been ever discussions about having catcache entries in a
shared memory area? This does not sound much performance-wise, I am
just wondering about the concept and I cannot find references to such
discussions.
-- 
Michael



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> Have there been ever discussions about having catcache entries in a
> shared memory area? This does not sound much performance-wise, I am
> just wondering about the concept and I cannot find references to such
> discussions.

I'm sure it's been discussed.  Offhand I remember the following issues:

* A shared cache would create locking and contention overhead.

* A shared cache would have a very hard size limit, at least if it's
in SysV-style shared memory (perhaps DSM would let us relax that).

* Transactions that are doing DDL have a requirement for the catcache
to reflect changes that they've made locally but not yet committed,
so said changes mustn't be visible globally.

You could possibly get around the third point with a local catcache that's
searched before the shared one, but tuning that to be performant sounds
like a mess.  Also, I'm not sure how such a structure could cope with
uncommitted deletions: delete A -> remove A from local catcache, but not
the shared one -> search for A in local catcache -> not found -> search
for A in shared catcache -> found -> oops.
        regards, tom lane



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Fri, Jan 13, 2017 at 8:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> Have there been ever discussions about having catcache entries in a
>> shared memory area? This does not sound much performance-wise, I am
>> just wondering about the concept and I cannot find references to such
>> discussions.
>
> I'm sure it's been discussed.  Offhand I remember the following issues:
>
> * A shared cache would create locking and contention overhead.
>
> * A shared cache would have a very hard size limit, at least if it's
> in SysV-style shared memory (perhaps DSM would let us relax that).
>
> * Transactions that are doing DDL have a requirement for the catcache
> to reflect changes that they've made locally but not yet committed,
> so said changes mustn't be visible globally.
>
> You could possibly get around the third point with a local catcache that's
> searched before the shared one, but tuning that to be performant sounds
> like a mess.  Also, I'm not sure how such a structure could cope with
> uncommitted deletions: delete A -> remove A from local catcache, but not
> the shared one -> search for A in local catcache -> not found -> search
> for A in shared catcache -> found -> oops.

I think the first of those concerns is the key one.  If searching the
system catalogs costs $100 and searching the private catcache costs
$1, what's the cost of searching a hypothetical shared catcache?  If
the answer is $80, it's not worth doing.  If the answer is $5, it's
probably still not worth doing.  If the answer is $1.25, then it's
probably worth investing some energy into trying to solve the other
problems you list.  For some users, the memory cost of catcache and
syscache entries multiplied by N backends are a very serious problem,
so it would be nice to have some other options.  But we do so many
syscache lookups that a shared cache won't be viable unless it's
almost as fast as a backend-private cache, or at least that's my
hunch.

I think it would be interested for somebody to build a prototype here
that ignores all the problems but the first and uses some
straightforward, relatively unoptimized locking strategy for the first
problem.  Then benchmark it.  If the results show that the idea has
legs, then we can try to figure out what a real implementation would
look like.

(One possible approach: use Thomas Munro's DHT stuff to build the shared cache.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Michael Paquier
Дата:
On Sat, Jan 14, 2017 at 12:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 13, 2017 at 8:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Michael Paquier <michael.paquier@gmail.com> writes:
>>> Have there been ever discussions about having catcache entries in a
>>> shared memory area? This does not sound much performance-wise, I am
>>> just wondering about the concept and I cannot find references to such
>>> discussions.
>>
>> I'm sure it's been discussed.  Offhand I remember the following issues:
>>
>> * A shared cache would create locking and contention overhead.
>>
>> * A shared cache would have a very hard size limit, at least if it's
>> in SysV-style shared memory (perhaps DSM would let us relax that).
>>
>> * Transactions that are doing DDL have a requirement for the catcache
>> to reflect changes that they've made locally but not yet committed,
>> so said changes mustn't be visible globally.
>>
>> You could possibly get around the third point with a local catcache that's
>> searched before the shared one, but tuning that to be performant sounds
>> like a mess.  Also, I'm not sure how such a structure could cope with
>> uncommitted deletions: delete A -> remove A from local catcache, but not
>> the shared one -> search for A in local catcache -> not found -> search
>> for A in shared catcache -> found -> oops.
>
> I think the first of those concerns is the key one.  If searching the
> system catalogs costs $100 and searching the private catcache costs
> $1, what's the cost of searching a hypothetical shared catcache?  If
> the answer is $80, it's not worth doing.  If the answer is $5, it's
> probably still not worth doing.  If the answer is $1.25, then it's
> probably worth investing some energy into trying to solve the other
> problems you list.  For some users, the memory cost of catcache and
> syscache entries multiplied by N backends are a very serious problem,
> so it would be nice to have some other options.  But we do so many
> syscache lookups that a shared cache won't be viable unless it's
> almost as fast as a backend-private cache, or at least that's my
> hunch.

Being able to switch from one mode to another would be interesting.
Applications using extensing DDLs that require to change the catcache
with an exclusive lock would clearly pay the lock contention cost, but
do you think that be really the case of a shared lock? A bunch of
applications that I work with deploy Postgres once, then don't change
the schema except when an upgrade happens. So that would be benefitial
for that. There are even some apps that do not use pgbouncer, but drop
sessions after a timeout of inactivity to avoid a memory bloat because
of the problem of this thread. That won't solve the problem of the
local catcache bloat, but some users using few DDLs may be fine to pay
some extra concurrency cost if the session handling gets easied.

> I think it would be interested for somebody to build a prototype here
> that ignores all the problems but the first and uses some
> straightforward, relatively unoptimized locking strategy for the first
> problem.  Then benchmark it.  If the results show that the idea has
> legs, then we can try to figure out what a real implementation would
> look like.
> (One possible approach: use Thomas Munro's DHT stuff to build the shared cache.)

Yeah, I'd bet on a couple of days of focus to sort that out.
-- 
Michael



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> ... There are even some apps that do not use pgbouncer, but drop
> sessions after a timeout of inactivity to avoid a memory bloat because
> of the problem of this thread.

Yeah, a certain company I used to work for had to do that, though their
problem had more to do with bloat in plpgsql's compiled-functions cache
(and ensuing bloat in the plancache), I believe.

Still, I'm pretty suspicious of anything that will add overhead to
catcache lookups.  If you think the performance of those is not absolutely
critical, turning off the caches via -DCLOBBER_CACHE_ALWAYS will soon
disabuse you of the error.

I'm inclined to think that a more profitable direction to look in is
finding a way to limit the cache size.  I know we got rid of exactly that
years ago, but the problems with it were (a) the mechanism was itself
pretty expensive --- a global-to-all-caches LRU list IIRC, and (b) there
wasn't a way to tune the limit.  Possibly somebody can think of some
cheaper, perhaps less precise way of aging out old entries.  As for
(b), this is the sort of problem we made GUCs for.

But, again, the catcache isn't the only source of per-process bloat
and I'm not even sure it's the main one.  A more holistic approach
might be called for.
        regards, tom lane



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Andres Freund
Дата:
Hi,


On 2017-01-13 17:58:41 -0500, Tom Lane wrote:
> But, again, the catcache isn't the only source of per-process bloat
> and I'm not even sure it's the main one.  A more holistic approach
> might be called for.

It'd be helpful if we'd find a way to make it easy to get statistics
about the size of various caches in production systems. Right now that's
kinda hard, resulting in us having to make a lot of guesses...

Andres



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Tomas Vondra
Дата:
On 01/14/2017 12:06 AM, Andres Freund wrote:
> Hi,
>
>
> On 2017-01-13 17:58:41 -0500, Tom Lane wrote:
>> But, again, the catcache isn't the only source of per-process bloat
>> and I'm not even sure it's the main one.  A more holistic approach
>> might be called for.
>
> It'd be helpful if we'd find a way to make it easy to get statistics
> about the size of various caches in production systems. Right now
> that's kinda hard, resulting in us having to make a lot of
> guesses...
>

What about a simple C extension, that could inspect those caches? 
Assuming it could be loaded into a single backend, that should be 
relatively acceptable way (compared to loading it to all backends using 
shared_preload_libraries).


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Michael Paquier
Дата:
On Sat, Jan 14, 2017 at 9:36 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 01/14/2017 12:06 AM, Andres Freund wrote:
>> On 2017-01-13 17:58:41 -0500, Tom Lane wrote:
>>>
>>> But, again, the catcache isn't the only source of per-process bloat
>>> and I'm not even sure it's the main one.  A more holistic approach
>>> might be called for.
>>
>> It'd be helpful if we'd find a way to make it easy to get statistics
>> about the size of various caches in production systems. Right now
>> that's kinda hard, resulting in us having to make a lot of
>> guesses...
>
> What about a simple C extension, that could inspect those caches? Assuming
> it could be loaded into a single backend, that should be relatively
> acceptable way (compared to loading it to all backends using
> shared_preload_libraries).

This extension could do a small amount of work on a portion of the
syscache entries at each query loop, still I am wondering if that
would not be nicer to get that in-core and configurable, which is
basically the approach proposed by Horiguchi-san. At least it seems to
me that it has some merit, and if we could make that behavior
switchable, disabled by default, that would be a win for some class of
applications. What do others think?
-- 
Michael



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Jim Nasby
Дата:
On 12/26/16 2:31 AM, Kyotaro HORIGUCHI wrote:
> The points of discussion are the following, I think.
>
> 1. The first patch seems working well. It costs the time to scan
>    the whole of a catcache that have negative entries for other
>    reloids. However, such negative entries are created by rather
>    unusual usages. Accesing to undefined columns, and accessing
>    columns on which no statistics have created. The
>    whole-catcache scan occurs on ATTNAME, ATTNUM and
>    STATRELATTINH for every invalidation of a relcache entry.

I took a look at this. It looks sane, though I've got a few minor 
comment tweaks:

+ *    Remove negative cache tuples maching a partial key.
s/maching/matching/

+/* searching with a paritial key needs scanning the whole cache */

s/needs/means/

+ * a negative cache entry cannot be referenced so we can remove

s/referenced/referenced,/

I was wondering if there's a way to test the performance impact of 
deleting negative entries.

> 2. The second patch also works, but flushing negative entries by
>    hash values is inefficient. It scans the bucket corresponding
>    to given hash value for OIDs, then flushing negative entries
>    iterating over all the collected OIDs. So this costs more time
>    than 1 and flushes involving entries that is not necessary to
>    be removed. If this feature is valuable but such side effects
>    are not acceptable, new invalidation category based on
>    cacheid-oid pair would be needed.

I glanced at this and it looks sane. Didn't go any farther since this 
one's pretty up in the air. ISTM it'd be better to do some kind of aging 
instead of patch 2.

The other (possibly naive) question I have is how useful negative 
entries really are? Will Postgres regularly incur negative lookups, or 
will these only happen due to user activity? I can't think of a case 
where an app would need to depend on fast negative lookup (in other 
words, it should be considered a bug in the app). I can see where 
getting rid of them completely might be problematic, but maybe we can 
just keep a relatively small number of them around. I'm thinking a 
simple LRU list of X number of negative entries; when that fills you 
reuse the oldest one. You'd have to pay the LRU maintenance cost on 
every negative hit, but if those shouldn't be that common it shouldn't 
be bad.

That might well necessitate another GUC, but it seems a lot simpler than 
most of the other ideas.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> The other (possibly naive) question I have is how useful negative 
> entries really are? Will Postgres regularly incur negative lookups, or 
> will these only happen due to user activity?

It varies depending on the particular syscache, but in at least some
of them, negative cache entries are critical for performance.
See for example RelnameGetRelid(), which basically does a RELNAMENSP
cache lookup for each schema down the search path until it finds a
match.  For any user table name with the standard search_path, there's
a guaranteed failure in pg_catalog before you can hope to find a match.
If we don't have negative cache entries, then *every invocation of this
function has to go to disk* (or at least to shared buffers).

It's possible that we could revise all our lookup patterns to avoid this
sort of thing.  But I don't have much faith in that always being possible,
and exactly none that we won't introduce new lookup patterns that need it
in future.  I spent some time, for instance, wondering if RelnameGetRelid
could use a SearchSysCacheList lookup instead, doing the lookup on table
name only and then inspecting the whole list to see which entry is
frontmost according to the current search path.  But that has performance
failure modes of its own, for example if you have identical table names in
a boatload of different schemas.  We do it that way for some other cases
such as function lookups, but I think it's much less likely that people
have identical function names in N schemas than that they have identical
table names in N schemas.

If you want to poke into this for particular test scenarios, building with
CATCACHE_STATS defined will yield a bunch of numbers dumped to the
postmaster log at each backend exit.
        regards, tom lane



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Jim Nasby
Дата:
On 1/21/17 8:54 PM, Tom Lane wrote:
> Jim Nasby <Jim.Nasby@bluetreble.com> writes:
>> The other (possibly naive) question I have is how useful negative
>> entries really are? Will Postgres regularly incur negative lookups, or
>> will these only happen due to user activity?
> It varies depending on the particular syscache, but in at least some
> of them, negative cache entries are critical for performance.
> See for example RelnameGetRelid(), which basically does a RELNAMENSP
> cache lookup for each schema down the search path until it finds a
> match.

Ahh, I hadn't considered that. So one idea would be to only track 
negative entries on caches where we know they're actually useful. That 
might make the performance hit of some of the other ideas more 
tolerable. Presumably you're much less likely to pollute the namespace 
cache than some of the others.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Jim Nasby
Дата:
On 1/22/17 4:41 PM, Jim Nasby wrote:
> On 1/21/17 8:54 PM, Tom Lane wrote:
>> Jim Nasby <Jim.Nasby@bluetreble.com> writes:
>>> The other (possibly naive) question I have is how useful negative
>>> entries really are? Will Postgres regularly incur negative lookups, or
>>> will these only happen due to user activity?
>> It varies depending on the particular syscache, but in at least some
>> of them, negative cache entries are critical for performance.
>> See for example RelnameGetRelid(), which basically does a RELNAMENSP
>> cache lookup for each schema down the search path until it finds a
>> match.
>
> Ahh, I hadn't considered that. So one idea would be to only track
> negative entries on caches where we know they're actually useful. That
> might make the performance hit of some of the other ideas more
> tolerable. Presumably you're much less likely to pollute the namespace
> cache than some of the others.

Ok, after reading the code I see I only partly understood what you were 
saying. In any case, it might still be useful to do some testing with 
CATCACHE_STATS defined to see if there's caches that don't accumulate a 
lot of negative entries.

Attached is a patch that tries to document some of this.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
>> Ahh, I hadn't considered that. So one idea would be to only track
>> negative entries on caches where we know they're actually useful. That
>> might make the performance hit of some of the other ideas more
>> tolerable. Presumably you're much less likely to pollute the namespace
>> cache than some of the others.

> Ok, after reading the code I see I only partly understood what you were 
> saying. In any case, it might still be useful to do some testing with 
> CATCACHE_STATS defined to see if there's caches that don't accumulate a 
> lot of negative entries.

There definitely are, according to my testing, but by the same token
it's not clear that a shutoff check would save anything.
        regards, tom lane



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Jim Nasby
Дата:
On 1/22/17 5:03 PM, Tom Lane wrote:
>> Ok, after reading the code I see I only partly understood what you were
>> saying. In any case, it might still be useful to do some testing with
>> CATCACHE_STATS defined to see if there's caches that don't accumulate a
>> lot of negative entries.
> There definitely are, according to my testing, but by the same token
> it's not clear that a shutoff check would save anything.

Currently they wouldn't, but there's concerns about the performance of 
some of the other ideas in this thread. Getting rid of negative entries 
that don't really help could reduce some of those concerns. Or perhaps 
the original complaint about STATRELATTINH could be solved by just 
disabling negative entries on that cache.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Jim Nasby
Дата:
On 1/21/17 6:42 PM, Jim Nasby wrote:
> On 12/26/16 2:31 AM, Kyotaro HORIGUCHI wrote:
>> The points of discussion are the following, I think.
>>
>> 1. The first patch seems working well. It costs the time to scan
>>    the whole of a catcache that have negative entries for other
>>    reloids. However, such negative entries are created by rather
>>    unusual usages. Accesing to undefined columns, and accessing
>>    columns on which no statistics have created. The
>>    whole-catcache scan occurs on ATTNAME, ATTNUM and
>>    STATRELATTINH for every invalidation of a relcache entry.
>
> I took a look at this. It looks sane, though I've got a few minor
> comment tweaks:
>
> + *    Remove negative cache tuples maching a partial key.
> s/maching/matching/
>
> +/* searching with a paritial key needs scanning the whole cache */
>
> s/needs/means/
>
> + * a negative cache entry cannot be referenced so we can remove
>
> s/referenced/referenced,/
>
> I was wondering if there's a way to test the performance impact of
> deleting negative entries.

I did a make installcheck run with CATCACHE_STATS to see how often we 
get negative entries in the 3 caches affected by this patch. The caches 
on pg_attribute get almost no negative entries. pg_statistic gets a good 
amount of negative entries, presumably because we start off with no 
entries in there. On a stable system that presumably won't be an issue, 
but if temporary tables are in use and being analyzed I'd think there 
could be a moderate amount of inval traffic on that cache. I'll leave it 
to a committer to decide if they thing that's an issue, but you might 
want to try and quantify how big a hit that is. I think it'd also be 
useful to know how much bloat you were seeing in the field.

The patch is currently conflicting against master though, due to some 
caches being added. Can you rebase? BTW, if you set a slightly larger 
context size on the patch you might be able to avoid rebases; right now 
the patch doesn't include enough context to uniquely identify the chunks 
against cacheinfo[].
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Kyotaro HORIGUCHI
Дата:
Hello, thank you for lookin this.

At Mon, 23 Jan 2017 16:54:36 -0600, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote in
<21803f50-a823-c444-ee2b-9a153114f454@BlueTreble.com>
> On 1/21/17 6:42 PM, Jim Nasby wrote:
> > On 12/26/16 2:31 AM, Kyotaro HORIGUCHI wrote:
> >> The points of discussion are the following, I think.
> >>
> >> 1. The first patch seems working well. It costs the time to scan
> >>    the whole of a catcache that have negative entries for other
> >>    reloids. However, such negative entries are created by rather
> >>    unusual usages. Accesing to undefined columns, and accessing
> >>    columns on which no statistics have created. The
> >>    whole-catcache scan occurs on ATTNAME, ATTNUM and
> >>    STATRELATTINH for every invalidation of a relcache entry.
> >
> > I took a look at this. It looks sane, though I've got a few minor
> > comment tweaks:
> >
> > + *    Remove negative cache tuples maching a partial key.
> > s/maching/matching/
> >
> > +/* searching with a paritial key needs scanning the whole cache */
> >
> > s/needs/means/
> >
> > + * a negative cache entry cannot be referenced so we can remove
> >
> > s/referenced/referenced,/
> >
> > I was wondering if there's a way to test the performance impact of
> > deleting negative entries.

Thanks for the pointing out. These are addressed.

> I did a make installcheck run with CATCACHE_STATS to see how often we
> get negative entries in the 3 caches affected by this patch. The
> caches on pg_attribute get almost no negative entries. pg_statistic
> gets a good amount of negative entries, presumably because we start
> off with no entries in there. On a stable system that presumably won't
> be an issue, but if temporary tables are in use and being analyzed I'd
> think there could be a moderate amount of inval traffic on that
> cache. I'll leave it to a committer to decide if they thing that's an
> issue, but you might want to try and quantify how big a hit that is. I
> think it'd also be useful to know how much bloat you were seeing in
> the field.
> 
> The patch is currently conflicting against master though, due to some
> caches being added. Can you rebase?

Six new syscaches in 665d1fa was conflicted and 3-way merge
worked correctly. The new syscaches don't seem to be targets of
this patch.

> BTW, if you set a slightly larger
> context size on the patch you might be able to avoid rebases; right
> now the patch doesn't include enough context to uniquely identify the
> chunks against cacheinfo[].

git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm
not sure that such a hunk can avoid rebases. Is this what you
suggested? -U4 added an identifiable forward context line for
some elements so the attached patch is made with four context
lines.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Kyotaro HORIGUCHI
Дата:
Hello,

I have tried to cap the number of negative entries for myself (by
removing negative entries in least recentrly created first order)
but the ceils cannot be reasonably determined both absolutely or
relatively to positive entries. Apparently it differs widely
among caches and applications.

At Mon, 23 Jan 2017 08:16:49 -0600, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote in
<6519b7ad-0aa6-c9f4-8869-20691107fb69@BlueTreble.com>
> On 1/22/17 5:03 PM, Tom Lane wrote:
> >> Ok, after reading the code I see I only partly understood what you
> >> were
> >> saying. In any case, it might still be useful to do some testing with
> >> CATCACHE_STATS defined to see if there's caches that don't accumulate
> >> a
> >> lot of negative entries.
> > There definitely are, according to my testing, but by the same token
> > it's not clear that a shutoff check would save anything.
> 
> Currently they wouldn't, but there's concerns about the performance of
> some of the other ideas in this thread. Getting rid of negative
> entries that don't really help could reduce some of those concerns. Or
> perhaps the original complaint about STATRELATTINH could be solved by
> just disabling negative entries on that cache.

As for STATRELATTINH, planning involving small temporary tablesthat frequently accessed willget benefit from negative
entries,butit might ignorably small. ATTNAME, ATTNUM and RENAMENSP alsomight not get so much from negative entries. If
theseare true,the whole stuff this patch adds can be replaced with just aboolean in cachedesc that inhibits negatvie
entries.Anyway thispatch don't save the case of the cache bloat relaed to functionreference. I'm not sure how that
couldbe reproduced, though.
 

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Michael Paquier
Дата:
On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Six new syscaches in 665d1fa was conflicted and 3-way merge
> worked correctly. The new syscaches don't seem to be targets of
> this patch.

To be honest, I am not completely sure what to think about this patch.
Moved to next CF as there is a new version, and no new reviews to make
the discussion perhaps move on.
-- 
Michael



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Kyotaro HORIGUCHI
Дата:
Hello, thank you for moving this to the next CF.

At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>
> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Six new syscaches in 665d1fa was conflicted and 3-way merge
> > worked correctly. The new syscaches don't seem to be targets of
> > this patch.
> 
> To be honest, I am not completely sure what to think about this patch.
> Moved to next CF as there is a new version, and no new reviews to make
> the discussion perhaps move on.

I'm thinking the following is the status of this topic.

- The patch stll is not getting conflicted.

- This is not a hollistic measure for memory leak but surely saves some existing cases.

- Shared catcache is another discussion (and won't really proposed in a short time due to the issue on locking.)

- As I mentioned, a patch that caps the number of negative entries is avaiable (in first-created - first-delete manner)
butit is having a loose end of how to determine the limitation.
 

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
David Steele
Дата:
On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote:
> Hello, thank you for moving this to the next CF.
> 
> At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>
>> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> Six new syscaches in 665d1fa was conflicted and 3-way merge
>>> worked correctly. The new syscaches don't seem to be targets of
>>> this patch.
>>
>> To be honest, I am not completely sure what to think about this patch.
>> Moved to next CF as there is a new version, and no new reviews to make
>> the discussion perhaps move on.
> 
> I'm thinking the following is the status of this topic.
> 
> - The patch stll is not getting conflicted.
> 
> - This is not a hollistic measure for memory leak but surely
>   saves some existing cases.
> 
> - Shared catcache is another discussion (and won't really
>   proposed in a short time due to the issue on locking.)
> 
> - As I mentioned, a patch that caps the number of negative
>   entries is avaiable (in first-created - first-delete manner)
>   but it is having a loose end of how to determine the
>   limitation.

While preventing bloat in the syscache is a worthwhile goal, it appears
there are a number of loose ends here and a new patch has not been provided.

It's a pretty major change so I recommend moving this patch to the
2017-07 CF.

-- 
-David
david@pgmasters.net



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
David Steele
Дата:
On 3/3/17 4:54 PM, David Steele wrote:

> On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote:
>> Hello, thank you for moving this to the next CF.
>>
>> At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>
>>> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>> Six new syscaches in 665d1fa was conflicted and 3-way merge
>>>> worked correctly. The new syscaches don't seem to be targets of
>>>> this patch.
>>> To be honest, I am not completely sure what to think about this patch.
>>> Moved to next CF as there is a new version, and no new reviews to make
>>> the discussion perhaps move on.
>> I'm thinking the following is the status of this topic.
>>
>> - The patch stll is not getting conflicted.
>>
>> - This is not a hollistic measure for memory leak but surely
>>    saves some existing cases.
>>
>> - Shared catcache is another discussion (and won't really
>>    proposed in a short time due to the issue on locking.)
>>
>> - As I mentioned, a patch that caps the number of negative
>>    entries is avaiable (in first-created - first-delete manner)
>>    but it is having a loose end of how to determine the
>>    limitation.
> While preventing bloat in the syscache is a worthwhile goal, it appears
> there are a number of loose ends here and a new patch has not been provided.
>
> It's a pretty major change so I recommend moving this patch to the
> 2017-07 CF.

Not hearing any opinions pro or con, I'm moving this patch to the 
2017-07 CF.

-- 
-David
david@pgmasters.net




Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Tue, 7 Mar 2017 19:23:14 -0800, David Steele <david@pgmasters.net> wrote in
<3b7b7f90-db46-8c37-c4f7-443330c3ae33@pgmasters.net>
> On 3/3/17 4:54 PM, David Steele wrote:
> 
> > On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote:
> >> Hello, thank you for moving this to the next CF.
> >>
> >> At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier
> >> <michael.paquier@gmail.com> wrote in
> >> <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>
> >>> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
> >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>>> Six new syscaches in 665d1fa was conflicted and 3-way merge
> >>>> worked correctly. The new syscaches don't seem to be targets of
> >>>> this patch.
> >>> To be honest, I am not completely sure what to think about this patch.
> >>> Moved to next CF as there is a new version, and no new reviews to make
> >>> the discussion perhaps move on.
> >> I'm thinking the following is the status of this topic.
> >>
> >> - The patch stll is not getting conflicted.
> >>
> >> - This is not a hollistic measure for memory leak but surely
> >>    saves some existing cases.
> >>
> >> - Shared catcache is another discussion (and won't really
> >>    proposed in a short time due to the issue on locking.)
> >>
> >> - As I mentioned, a patch that caps the number of negative
> >>    entries is avaiable (in first-created - first-delete manner)
> >>    but it is having a loose end of how to determine the
> >>    limitation.
> > While preventing bloat in the syscache is a worthwhile goal, it
> > appears
> > there are a number of loose ends here and a new patch has not been
> > provided.
> >
> > It's a pretty major change so I recommend moving this patch to the
> > 2017-07 CF.
> 
> Not hearing any opinions pro or con, I'm moving this patch to the
> 2017-07 CF.

Ah. Yes, I agree on this. Thanks.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Peter Eisentraut
Дата:
On 1/24/17 02:58, Kyotaro HORIGUCHI wrote:
>> BTW, if you set a slightly larger
>> context size on the patch you might be able to avoid rebases; right
>> now the patch doesn't include enough context to uniquely identify the
>> chunks against cacheinfo[].
> git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm
> not sure that such a hunk can avoid rebases. Is this what you
> suggested? -U4 added an identifiable forward context line for
> some elements so the attached patch is made with four context
> lines.

This patch needs another rebase for the upcoming commit fest.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Kyotaro HORIGUCHI
Дата:
Thank you for your attention.

At Mon, 14 Aug 2017 17:33:48 -0400, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in
<09fa011f-4536-b05d-0625-11f3625d8332@2ndquadrant.com>
> On 1/24/17 02:58, Kyotaro HORIGUCHI wrote:
> >> BTW, if you set a slightly larger
> >> context size on the patch you might be able to avoid rebases; right
> >> now the patch doesn't include enough context to uniquely identify the
> >> chunks against cacheinfo[].
> > git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm
> > not sure that such a hunk can avoid rebases. Is this what you
> > suggested? -U4 added an identifiable forward context line for
> > some elements so the attached patch is made with four context
> > lines.
> 
> This patch needs another rebase for the upcoming commit fest.

This patch have had interferences from several commits after the
last submission. I amended this patch to follow them (up to
f97c55c), removed an unnecessary branch and edited some comments.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Mon, Aug 28, 2017 at 5:24 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> This patch have had interferences from several commits after the
> last submission. I amended this patch to follow them (up to
> f97c55c), removed an unnecessary branch and edited some comments.

I think the core problem for this patch is that there's no consensus
on what approach to take.  Until that somehow gets sorted out, I think
this isn't going to make any progress.  Unfortunately, I don't have a
clear idea what sort of solution everybody could tolerate.

I still think that some kind of slow-expire behavior -- like a clock
hand that hits each backend every 10 minutes and expires entries not
used since the last hit -- is actually pretty sensible.  It ensures
that idle or long-running backends don't accumulate infinite bloat
while still allowing the cache to grow large enough for good
performance when all entries are being regularly used.  But Tom
doesn't like it.  Other approaches were also discussed; none of them
seem like an obvious slam-dunk.

Turning to the patch itself, I don't know how we decide whether the
patch is worth it.  Scanning the whole (potentially large) cache to
remove negative entries has a cost, mostly in CPU cycles; keeping
those negative entries around for a long time also has a cost, mostly
in memory.  I don't know how to decide whether these patches will help
more people than it hurts, or the other way around -- and it's not
clear that anyone else has a good idea about that either.

Typos: funciton, paritial.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Thomas Munro
Дата:
On Mon, Aug 28, 2017 at 9:24 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> This patch have had interferences from several commits after the
> last submission. I amended this patch to follow them (up to
> f97c55c), removed an unnecessary branch and edited some comments.

Hi Kyotaro-san,

This applies but several regression tests fail for me.  Here is a
sample backtrace:
   frame #3: 0x000000010f0614c0
postgres`ExceptionalCondition(conditionName="!(attnum < 0 ? attnum ==
(-2) : cache->cc_tupdesc->attrs[attnum].atttypid == 26)",
errorType="FailedAssertion", fileName="catcache.c", lineNumber=1384) +
128 at assert.c:54   frame #4: 0x000000010f03b5fd
postgres`CollectOIDsForHashValue(cache=0x00007fe273821268,
hashValue=994410284, attnum=0) + 253 at catcache.c:1383   frame #5: 0x000000010f055e8e
postgres`SysCacheSysCacheInvalCallback(arg=140610577303984, cacheid=0,
hashValue=994410284) + 94 at syscache.c:1692   frame #6: 0x000000010f03fbbb
postgres`CallSyscacheCallbacks(cacheid=0, hashvalue=994410284) + 219
at inval.c:1468   frame #7: 0x000000010f03f878
postgres`LocalExecuteInvalidationMessage(msg=0x00007fff51213ff8) + 88
at inval.c:566   frame #8: 0x000000010ee7a3f2
postgres`ReceiveSharedInvalidMessages(invalFunction=(postgres`LocalExecuteInvalidationMessage
at inval.c:555), resetFunction=(postgres`InvalidateSystemCaches at
inval.c:647)) + 354 at sinval.c:121   frame #9: 0x000000010f03fcb7 postgres`AcceptInvalidationMessages +
23 at inval.c:686   frame #10: 0x000000010eade609 postgres`AtStart_Cache + 9 at xact.c:987   frame #11:
0x000000010ead8c2fpostgres`StartTransaction + 655 at xact.c:1921   frame #12: 0x000000010ead8896
postgres`StartTransactionCommand+
 
70 at xact.c:2691   frame #13: 0x000000010eea9746 postgres`start_xact_command + 22 at
postgres.c:2438   frame #14: 0x000000010eea722e
postgres`exec_simple_query(query_string="RESET SESSION
AUTHORIZATION;") + 126 at postgres.c:913   frame #15: 0x000000010eea68d7 postgres`PostgresMain(argc=1,
argv=0x00007fe2738036a8, dbname="regression", username="munro") + 2375
at postgres.c:4090   frame #16: 0x000000010eded40e
postgres`BackendRun(port=0x00007fe2716001a0) + 654 at
postmaster.c:4357   frame #17: 0x000000010edec793
postgres`BackendStartup(port=0x00007fe2716001a0) + 483 at
postmaster.c:4029   frame #18: 0x000000010edeb785 postgres`ServerLoop + 597 at postmaster.c:1753   frame #19:
0x000000010ede8f71postgres`PostmasterMain(argc=8,
 
argv=0x00007fe271403860) + 5553 at postmaster.c:1361   frame #20: 0x000000010ed0ccd9 postgres`main(argc=8,
argv=0x00007fe271403860) + 761 at main.c:228   frame #21: 0x00007fff8333a5ad libdyld.dylib`start + 1

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Kyotaro HORIGUCHI
Дата:
Thank you for reviewing this.

At Sat, 2 Sep 2017 12:12:47 +1200, Thomas Munro <thomas.munro@enterprisedb.com> wrote in
<CAEepm=3wqPFFSKP_yhkuHLZtOOwZskGuHJdSctVnbHQ4DFEH+Q@mail.gmail.com>
> On Mon, Aug 28, 2017 at 9:24 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > This patch have had interferences from several commits after the
> > last submission. I amended this patch to follow them (up to
> > f97c55c), removed an unnecessary branch and edited some comments.
> 
> Hi Kyotaro-san,
> 
> This applies but several regression tests fail for me.  Here is a
> sample backtrace:

Sorry for the silly mistake. STAEXTNAMENSP and STATRELATTINH was
missing additional elements in their definitions. Somehow I've
removed them.

The attached patch doesn't crash by regression test. And fixed
some typos pointed by Robert and found by myself.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Protect syscache from bloating with negative cacheentries

От
Kyotaro HORIGUCHI
Дата:
Thank you for the comment.

At Mon, 28 Aug 2017 21:31:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com>
> On Mon, Aug 28, 2017 at 5:24 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > This patch have had interferences from several commits after the
> > last submission. I amended this patch to follow them (up to
> > f97c55c), removed an unnecessary branch and edited some comments.
> 
> I think the core problem for this patch is that there's no consensus
> on what approach to take.  Until that somehow gets sorted out, I think
> this isn't going to make any progress.  Unfortunately, I don't have a
> clear idea what sort of solution everybody could tolerate.
> 
> I still think that some kind of slow-expire behavior -- like a clock
> hand that hits each backend every 10 minutes and expires entries not
> used since the last hit -- is actually pretty sensible.  It ensures
> that idle or long-running backends don't accumulate infinite bloat
> while still allowing the cache to grow large enough for good
> performance when all entries are being regularly used.  But Tom
> doesn't like it.  Other approaches were also discussed; none of them
> seem like an obvious slam-dunk.

I suppose that it slows intermittent lookup of non-existent
objects. I have tried a slight different thing. Removing entries
by 'age', preserving specified number (or ratio to live entries)
of younger negative entries. The problem of that approach was I
didn't find how to determine the number of entries to preserve,
or I didn't want to offer additional knobs for them. Finally I
proposed the patch upthread since it doesn't need any assumption
on usage.

Though I can make another patch that does the same thing based on
LRU, the same how-many-to-preserve problem ought to be resolved
in order to avoid slowing the inermittent lookup.

> Turning to the patch itself, I don't know how we decide whether the
> patch is worth it.  Scanning the whole (potentially large) cache to
> remove negative entries has a cost, mostly in CPU cycles; keeping
> those negative entries around for a long time also has a cost, mostly
> in memory.  I don't know how to decide whether these patches will help
> more people than it hurts, or the other way around -- and it's not
> clear that anyone else has a good idea about that either.

Scanning a hash on invalidation of several catalogs (hopefully
slightly) slows certain percentage of inavlidations on maybe most
of workloads. Holding no-longer-lookedup entries surely kills a
backend under certain workloads sooner or later.  This doesn't
save the pg_proc cases, but saves pg_statistic and pg_class
cases. I'm not sure what other catalogs can bloat.

I could reduce the complexity of this. Inval mechanism conveys
only a hash value so this scans the whole of a cache for the
target OIDs (with possible spurious targets). This will be
resolved by letting inval mechanism convey an OID. (but this may
need additional members in an inval entry.)

Still, the full scan perfomed in CleanupCatCacheNegEntries
doesn't seem easily avoidable. Separating the hash by OID of key
or provide special dlist that points tuples in buckets will
introduce another complexity.


> Typos: funciton, paritial.

Thanks. ispell told me of additional typos corresnpond, belive
and undistinguisable.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
This is a rebased version of the patch.

At Fri, 17 Mar 2017 14:23:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170317.142313.232290068.horiguchi.kyotaro@lab.ntt.co.jp>
> At Tue, 7 Mar 2017 19:23:14 -0800, David Steele <david@pgmasters.net> wrote in
<3b7b7f90-db46-8c37-c4f7-443330c3ae33@pgmasters.net>
> > On 3/3/17 4:54 PM, David Steele wrote:
> > 
> > > On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote:
> > >> Hello, thank you for moving this to the next CF.
> > >>
> > >> At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier
> > >> <michael.paquier@gmail.com> wrote in
> > >> <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com>
> > >>> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI
> > >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > >>>> Six new syscaches in 665d1fa was conflicted and 3-way merge
> > >>>> worked correctly. The new syscaches don't seem to be targets of
> > >>>> this patch.
> > >>> To be honest, I am not completely sure what to think about this patch.
> > >>> Moved to next CF as there is a new version, and no new reviews to make
> > >>> the discussion perhaps move on.
> > >> I'm thinking the following is the status of this topic.
> > >>
> > >> - The patch stll is not getting conflicted.
> > >>
> > >> - This is not a hollistic measure for memory leak but surely
> > >>    saves some existing cases.
> > >>
> > >> - Shared catcache is another discussion (and won't really
> > >>    proposed in a short time due to the issue on locking.)
> > >>
> > >> - As I mentioned, a patch that caps the number of negative
> > >>    entries is avaiable (in first-created - first-delete manner)
> > >>    but it is having a loose end of how to determine the
> > >>    limitation.
> > > While preventing bloat in the syscache is a worthwhile goal, it
> > > appears
> > > there are a number of loose ends here and a new patch has not been
> > > provided.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 9f2c81dbc9bc344cafd6995dfc5969d55a8457d9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 28 Aug 2017 11:36:21 +0900
Subject: [PATCH 1/2] Cleanup negative cache of pg_statistic when dropping arelation.

Accessing columns that don't have statistics leaves negative entries
in catcache for pg_statstic, but there's no chance to remove
them. Especially when repeatedly creating then dropping temporary
tables bloats catcache so much that memory pressure becomes
significant. This patch removes negative entries in STATRELATTINH,
ATTNAME and ATTNUM when corresponding relation is dropped.
---src/backend/utils/cache/catcache.c |  58 ++++++-src/backend/utils/cache/syscache.c | 302
+++++++++++++++++++++++++++----------src/include/utils/catcache.h      |   3 +src/include/utils/syscache.h       |   2
+4files changed, 282 insertions(+), 83 deletions(-)
 

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 95a0742..bd303f3 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -423,10 +423,11 @@ CatCachePrintStats(int code, Datum arg)        if (cache->cc_ntup == 0 && cache->cc_searches ==
0)           continue;            /* don't print unused caches */
 
-        elog(DEBUG2, "catcache %s/%u: %d tup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch,
%ldlhits",
 
+        elog(DEBUG2, "catcache %s/%u: %d tup, %d negtup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals,
%ldlsrch, %ld lhits",             cache->cc_relname,             cache->cc_indexoid,             cache->cc_ntup,
 
+             cache->cc_nnegtup,             cache->cc_searches,             cache->cc_hits,
cache->cc_neg_hits,
@@ -495,8 +496,11 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)     * point into tuple, allocated together with
theCatCTup.     */    if (ct->negative)
 
+    {        CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,                         cache->cc_keyno, ct->keys);
+        --cache->cc_nnegtup;
+    }    pfree(ct);
@@ -697,6 +701,51 @@ ResetCatalogCache(CatCache *cache)}/*
+ *        CleanupCatCacheNegEntries
+ *
+ *    Remove negative cache tuples matching a partial key.
+ *
+ */
+void
+CleanupCatCacheNegEntries(CatCache *cache, ScanKeyData *skey)
+{
+    int i;
+
+    /* If this cache has no negative entries, nothing to do */
+    if (cache->cc_nnegtup == 0)
+        return;
+
+    /* searching with a partial key means scanning the whole cache */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_head *bucket = &cache->cc_bucket[i];
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, bucket)
+        {
+            const CCFastEqualFN *cc_fastequal = cache->cc_fastequal;
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            int            oid_attnum = skey->sk_attno - 1;
+
+            if (!ct->negative)
+                continue;
+
+            /* Compare the OIDs */
+            if (!(cc_fastequal[oid_attnum])(ct->keys[oid_attnum],
+                                            skey[0].sk_argument))
+                continue;
+
+            /*
+             * the negative cache entries can no longer be referenced, so we
+             * can remove it unconditionally
+             */
+            CatCacheRemoveCTup(cache, ct);
+        }
+    }
+}
+
+
+/* *        ResetCatalogCaches * * Reset all caches when a shared cache inval event forces it
@@ -845,6 +894,7 @@ InitCatCache(int id,    cp->cc_relisshared = false; /* temporary */    cp->cc_tupdesc = (TupleDesc)
NULL;   cp->cc_ntup = 0;
 
+    cp->cc_nnegtup = 0;    cp->cc_nbuckets = nbuckets;    cp->cc_nkeys = nkeys;    for (i = 0; i < nkeys; ++i)
@@ -1420,8 +1470,8 @@ SearchCatCacheMiss(CatCache *cache,        CACHE4_elog(DEBUG2, "SearchCatCache(%s): Contains
%d/%dtuples",                    cache->cc_relname, cache->cc_ntup, CacheHdr->ch_ntup);
 
-        CACHE3_elog(DEBUG2, "SearchCatCache(%s): put neg entry in bucket %d",
-                    cache->cc_relname, hashIndex);
+        CACHE4_elog(DEBUG2, "SearchCatCache(%s): put neg entry in bucket %d, total %d",
+                    cache->cc_relname, hashIndex, cache->cc_nnegtup);        /*         * We are not returning the
negativeentry to the caller, so leave its
 
@@ -1906,6 +1956,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,    cache->cc_ntup++;
CacheHdr->ch_ntup++;
+    if (negative)
+        cache->cc_nnegtup++;    /*     * If the hash table has become too full, enlarge the buckets array. Quite
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index 888edbb..753c5f1 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -75,6 +75,8 @@#include "catalog/pg_user_mapping.h"#include "utils/rel.h"#include "utils/catcache.h"
+#include "utils/fmgroids.h"
+#include "utils/inval.h"#include "utils/syscache.h"
@@ -118,6 +120,10 @@ struct cachedesc    int            nkeys;            /* # of keys needed for cache lookup */
int           key[4];            /* attribute numbers of key attrs */    int            nbuckets;        /* number of
hashbuckets for this cache */
 
+
+    /* relcache invalidation stuff */
+    AttrNumber    relattrnum;        /* attrnum to retrieve reloid for
+                                 * invalidation, 0 if not needed */};static const struct cachedesc cacheinfo[] = {
@@ -130,7 +136,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        16
+        16,
+        0    },    {AccessMethodRelationId,    /* AMNAME */        AmNameIndexId,
@@ -141,7 +148,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {AccessMethodRelationId,    /* AMOID */        AmOidIndexId,
@@ -152,7 +160,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {AccessMethodOperatorRelationId,    /* AMOPOPID */        AccessMethodOperatorIndexId,
@@ -163,7 +172,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_amop_amopfamily,            0
 },
 
-        64
+        64,
+        0    },    {AccessMethodOperatorRelationId,    /* AMOPSTRATEGY */        AccessMethodStrategyIndexId,
@@ -174,7 +184,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_amop_amoprighttype,
Anum_pg_amop_amopstrategy       },
 
-        64
+        64,
+        0    },    {AccessMethodProcedureRelationId,    /* AMPROCNUM */        AccessMethodProcedureIndexId,
@@ -185,7 +196,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_amproc_amprocrighttype,
Anum_pg_amproc_amprocnum       },
 
-        16
+        16,
+        0    },    {AttributeRelationId,        /* ATTNAME */        AttributeRelidNameIndexId,
@@ -196,7 +208,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        32
+        32,
+        Anum_pg_attribute_attrelid    },    {AttributeRelationId,        /* ATTNUM */
AttributeRelidNumIndexId,
@@ -207,7 +220,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        128
+        128,
+        Anum_pg_attribute_attrelid    },    {AuthMemRelationId,            /* AUTHMEMMEMROLE */
AuthMemMemRoleIndexId,
@@ -218,7 +232,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {AuthMemRelationId,            /* AUTHMEMROLEMEM */        AuthMemRoleMemIndexId,
@@ -229,7 +244,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {AuthIdRelationId,            /* AUTHNAME */        AuthIdRolnameIndexId,
@@ -240,7 +256,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {AuthIdRelationId,            /* AUTHOID */        AuthIdOidIndexId,
@@ -251,10 +268,10 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },
-    {
-        CastRelationId,            /* CASTSOURCETARGET */
+    {CastRelationId,            /* CASTSOURCETARGET */        CastSourceTargetIndexId,        2,        {
@@ -263,7 +280,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        256
+        256,
+        0    },    {OperatorClassRelationId,    /* CLAAMNAMENSP */        OpclassAmNameNspIndexId,
@@ -274,7 +292,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_opclass_opcnamespace,            0
      },
 
-        8
+        8,
+        0    },    {OperatorClassRelationId,    /* CLAOID */        OpclassOidIndexId,
@@ -285,7 +304,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {CollationRelationId,        /* COLLNAMEENCNSP */        CollationNameEncNspIndexId,
@@ -296,7 +316,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_collation_collnamespace,
0        },
 
-        8
+        8,
+        0    },    {CollationRelationId,        /* COLLOID */        CollationOidIndexId,
@@ -307,7 +328,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {ConversionRelationId,        /* CONDEFAULT */        ConversionDefaultIndexId,
@@ -318,7 +340,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_conversion_contoencoding,
 ObjectIdAttributeNumber,        },
 
-        8
+        8,
+        0    },    {ConversionRelationId,        /* CONNAMENSP */        ConversionNameNspIndexId,
@@ -329,7 +352,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {ConstraintRelationId,        /* CONSTROID */        ConstraintOidIndexId,
@@ -340,7 +364,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        16
+        16,
+        0    },    {ConversionRelationId,        /* CONVOID */        ConversionOidIndexId,
@@ -351,7 +376,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {DatabaseRelationId,        /* DATABASEOID */        DatabaseOidIndexId,
@@ -362,7 +388,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {DefaultAclRelationId,        /* DEFACLROLENSPOBJ */        DefaultAclRoleNspObjIndexId,
@@ -373,7 +400,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_default_acl_defaclobjtype,
  0        },
 
-        8
+        8,
+        0    },    {EnumRelationId,            /* ENUMOID */        EnumOidIndexId,
@@ -384,7 +412,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {EnumRelationId,            /* ENUMTYPOIDNAME */        EnumTypIdLabelIndexId,
@@ -395,7 +424,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {EventTriggerRelationId,    /* EVENTTRIGGERNAME */        EventTriggerNameIndexId,
@@ -406,7 +436,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {EventTriggerRelationId,    /* EVENTTRIGGEROID */        EventTriggerOidIndexId,
@@ -417,7 +448,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {ForeignDataWrapperRelationId,    /* FOREIGNDATAWRAPPERNAME */
ForeignDataWrapperNameIndexId,
@@ -428,7 +460,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {ForeignDataWrapperRelationId,    /* FOREIGNDATAWRAPPEROID */        ForeignDataWrapperOidIndexId,
@@ -439,7 +472,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {ForeignServerRelationId,    /* FOREIGNSERVERNAME */        ForeignServerNameIndexId,
@@ -450,7 +484,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {ForeignServerRelationId,    /* FOREIGNSERVEROID */        ForeignServerOidIndexId,
@@ -461,7 +496,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {ForeignTableRelationId,    /* FOREIGNTABLEREL */        ForeignTableRelidIndexId,
@@ -472,7 +508,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {IndexRelationId,            /* INDEXRELID */        IndexRelidIndexId,
@@ -483,7 +520,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        64
+        64,
+        0    },    {LanguageRelationId,        /* LANGNAME */        LanguageNameIndexId,
@@ -494,7 +532,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {LanguageRelationId,        /* LANGOID */        LanguageOidIndexId,
@@ -505,7 +544,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {NamespaceRelationId,        /* NAMESPACENAME */        NamespaceNameIndexId,
@@ -516,7 +556,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {NamespaceRelationId,        /* NAMESPACEOID */        NamespaceOidIndexId,
@@ -527,7 +568,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        16
+        16,
+        0    },    {OperatorRelationId,        /* OPERNAMENSP */        OperatorNameNspIndexId,
@@ -538,7 +580,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_operator_oprright,
Anum_pg_operator_oprnamespace       },
 
-        256
+        256,
+        0    },    {OperatorRelationId,        /* OPEROID */        OperatorOidIndexId,
@@ -549,7 +592,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        32
+        32,
+        0    },    {OperatorFamilyRelationId,    /* OPFAMILYAMNAMENSP */        OpfamilyAmNameNspIndexId,
@@ -560,7 +604,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_opfamily_opfnamespace,
0       },
 
-        8
+        8,
+        0    },    {OperatorFamilyRelationId,    /* OPFAMILYOID */        OpfamilyOidIndexId,
@@ -571,7 +616,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {PartitionedRelationId,        /* PARTRELID */        PartitionedRelidIndexId,
@@ -582,7 +628,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        32
+        32,
+        0    },    {ProcedureRelationId,        /* PROCNAMEARGSNSP */        ProcedureNameArgsNspIndexId,
@@ -593,7 +640,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_proc_pronamespace,            0
   },
 
-        128
+        128,
+        0    },    {ProcedureRelationId,        /* PROCOID */        ProcedureOidIndexId,
@@ -604,7 +652,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        128
+        128,
+        0    },    {PublicationRelationId,        /* PUBLICATIONNAME */        PublicationNameIndexId,
@@ -615,7 +664,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {PublicationRelationId,        /* PUBLICATIONOID */        PublicationObjectIndexId,
@@ -626,7 +676,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {PublicationRelRelationId,    /* PUBLICATIONREL */        PublicationRelObjectIndexId,
@@ -637,7 +688,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        64
+        64,
+        0    },    {PublicationRelRelationId,    /* PUBLICATIONRELMAP */        PublicationRelPrrelidPrpubidIndexId,
@@ -648,7 +700,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        64
+        64,
+        0    },    {RangeRelationId,            /* RANGETYPE */        RangeTypidIndexId,
@@ -659,7 +712,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {RelationRelationId,        /* RELNAMENSP */        ClassNameNspIndexId,
@@ -670,7 +724,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        128
+        128,
+        0    },    {RelationRelationId,        /* RELOID */        ClassOidIndexId,
@@ -681,7 +736,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        128
+        128,
+        0    },    {ReplicationOriginRelationId,    /* REPLORIGIDENT */        ReplicationOriginIdentIndex,
@@ -692,7 +748,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        16
+        16,
+        0    },    {ReplicationOriginRelationId,    /* REPLORIGNAME */        ReplicationOriginNameIndex,
@@ -703,7 +760,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        16
+        16,
+        0    },    {RewriteRelationId,            /* RULERELNAME */        RewriteRelRulenameIndexId,
@@ -714,7 +772,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        8
+        8,
+        0    },    {SequenceRelationId,        /* SEQRELID */        SequenceRelidIndexId,
@@ -725,7 +784,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        32
+        32,
+        0    },    {StatisticExtRelationId,    /* STATEXTNAMENSP */        StatisticExtNameIndexId,
@@ -736,7 +796,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {StatisticExtRelationId,    /* STATEXTOID */        StatisticExtOidIndexId,
@@ -747,7 +808,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {StatisticRelationId,        /* STATRELATTINH */        StatisticRelidAttnumInhIndexId,
@@ -758,7 +820,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_statistic_stainherit,            0
      },
 
-        128
+        128,
+        Anum_pg_statistic_starelid    },    {SubscriptionRelationId,    /* SUBSCRIPTIONNAME */
SubscriptionNameIndexId,
@@ -769,7 +832,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {SubscriptionRelationId,    /* SUBSCRIPTIONOID */        SubscriptionObjectIndexId,
@@ -780,7 +844,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        4
+        4,
+        0    },    {SubscriptionRelRelationId, /* SUBSCRIPTIONRELMAP */        SubscriptionRelSrrelidSrsubidIndexId,
@@ -791,7 +856,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        64
+        64,
+        0    },    {TableSpaceRelationId,        /* TABLESPACEOID */        TablespaceOidIndexId,
@@ -802,7 +868,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0,        },
-        4
+        4,
+        0    },    {TransformRelationId,        /* TRFOID */        TransformOidIndexId,
@@ -813,7 +880,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0,        },
-        16
+        16,
+        0    },    {TransformRelationId,        /* TRFTYPELANG */        TransformTypeLangIndexId,
@@ -824,7 +892,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0,        },
-        16
+        16,
+        0    },    {TSConfigMapRelationId,        /* TSCONFIGMAP */        TSConfigMapIndexId,
@@ -835,7 +904,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_ts_config_map_mapseqno,
0       },
 
-        2
+        2,
+        0    },    {TSConfigRelationId,        /* TSCONFIGNAMENSP */        TSConfigNameNspIndexId,
@@ -846,7 +916,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {TSConfigRelationId,        /* TSCONFIGOID */        TSConfigOidIndexId,
@@ -857,7 +928,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {TSDictionaryRelationId,    /* TSDICTNAMENSP */        TSDictionaryNameNspIndexId,
@@ -868,7 +940,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {TSDictionaryRelationId,    /* TSDICTOID */        TSDictionaryOidIndexId,
@@ -879,7 +952,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {TSParserRelationId,        /* TSPARSERNAMENSP */        TSParserNameNspIndexId,
@@ -890,7 +964,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {TSParserRelationId,        /* TSPARSEROID */        TSParserOidIndexId,
@@ -901,7 +976,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {TSTemplateRelationId,        /* TSTEMPLATENAMENSP */        TSTemplateNameNspIndexId,
@@ -912,7 +988,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {TSTemplateRelationId,        /* TSTEMPLATEOID */        TSTemplateOidIndexId,
@@ -923,7 +1000,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {TypeRelationId,            /* TYPENAMENSP */        TypeNameNspIndexId,
@@ -934,7 +1012,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        64
+        64,
+        0    },    {TypeRelationId,            /* TYPEOID */        TypeOidIndexId,
@@ -945,7 +1024,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        64
+        64,
+        0    },    {UserMappingRelationId,        /* USERMAPPINGOID */        UserMappingOidIndexId,
@@ -956,7 +1036,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    },    {UserMappingRelationId,        /* USERMAPPINGUSERSERVER */        UserMappingUserServerIndexId,
@@ -967,7 +1048,8 @@ static const struct cachedesc cacheinfo[] = {            0,            0        },
-        2
+        2,
+        0    }};
@@ -983,8 +1065,23 @@ static int    SysCacheRelationOidSize;static Oid    SysCacheSupportingRelOid[SysCacheSize *
2];staticint    SysCacheSupportingRelOidSize;
 
-static int    oid_compare(const void *a, const void *b);
+/*
+ * stuff for negative cache flushing by relcache invalidation
+ */
+#define MAX_RELINVAL_CALLBACKS 4
+typedef struct RELINVALCBParam
+{
+    CatCache *cache;
+    int          relkeynum;
+}  RELINVALCBParam;
+
+RELINVALCBParam relinval_callback_list[MAX_RELINVAL_CALLBACKS];
+static int relinval_callback_count = 0;
+
+static ScanKeyData    oideqscankey; /* ScanKey for reloid match  */
+static int    oid_compare(const void *a, const void *b);
+static void SysCacheRelInvalCallback(Datum arg, Oid reloid);/* * InitCatalogCache - initialize the caches
@@ -1028,6 +1125,21 @@ InitCatalogCache(void)            cacheinfo[cacheId].indoid;        /* see comments for
RelationInvalidatesSnapshotsOnly*/        Assert(!RelationInvalidatesSnapshotsOnly(cacheinfo[cacheId].reloid));
 
+
+        /*
+         * If this syscache is requesting relcache invalidation, register a
+         * callback
+         */
+        if (cacheinfo[cacheId].relattrnum > 0)
+        {
+            Assert(relinval_callback_count < MAX_RELINVAL_CALLBACKS);
+
+            relinval_callback_list[relinval_callback_count].cache  =
+                SysCache[cacheId];
+            relinval_callback_list[relinval_callback_count].relkeynum =
+                cacheinfo[cacheId].relattrnum;
+            relinval_callback_count++;
+        }    }    Assert(SysCacheRelationOidSize <= lengthof(SysCacheRelationOid));
@@ -1052,10 +1164,40 @@ InitCatalogCache(void)    }    SysCacheSupportingRelOidSize = j + 1;
+    /*
+     * prepare the scankey for reloid comparison and register a relcache inval
+     * callback.
+     */
+    oideqscankey.sk_strategy = BTEqualStrategyNumber;
+    oideqscankey.sk_subtype = InvalidOid;
+    oideqscankey.sk_collation = InvalidOid;
+    fmgr_info_cxt(F_OIDEQ, &oideqscankey.sk_func, CacheMemoryContext);
+    CacheRegisterRelcacheCallback(SysCacheRelInvalCallback, (Datum) 0);
+    CacheInitialized = true;}/*
+ * Callback function for negative cache flushing by relcache invalidation
+ * scankey for this function has been prepared in InitCatalogCache.
+ */
+static void
+SysCacheRelInvalCallback(Datum arg, Oid reloid)
+{
+    int i;
+
+    for(i = 0 ; i < relinval_callback_count ; i++)
+    {
+        ScanKeyData skey;
+
+        memcpy(&skey, &oideqscankey, sizeof(skey));
+        skey.sk_attno = relinval_callback_list[i].relkeynum;
+        skey.sk_argument = ObjectIdGetDatum(reloid);
+        CleanupCatCacheNegEntries(relinval_callback_list[i].cache, &skey);
+    }
+}
+
+/* * InitCatalogCachePhase2 - finish initializing the caches * * Finish initializing all the caches, including
necessarydatabase
 
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 74535eb..7564f42 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -59,6 +59,7 @@ typedef struct catcache    Oid            cc_indexoid;    /* OID of index matching cache keys */
bool       cc_relisshared; /* is relation shared across databases? */    slist_node    cc_next;        /* list link */
 
+    int            cc_nnegtup;        /* # of negative tuples */    ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /*
precomputedkey info for heap                                             * scans */
 
@@ -217,6 +218,8 @@ extern CatCList *SearchCatCacheList(CatCache *cache, int nkeys,                   Datum v3, Datum
v4);externvoid ReleaseCatCacheList(CatCList *list);
 
+extern void
+CleanupCatCacheNegEntries(CatCache *cache, ScanKeyData *skey);extern void ResetCatalogCaches(void);extern void
CatalogCacheFlushCatalog(OidcatId);extern void CatCacheInvalidate(CatCache *cache, uint32 hashValue);
 
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 8a0be41..26ac57c 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -132,6 +132,8 @@ extern HeapTuple SearchSysCache4(int cacheId,                Datum key1, Datum key2, Datum key3,
Datumkey4);extern void ReleaseSysCache(HeapTuple tuple);
 
+extern void CleanupNegativeCache(int cacheid, int nkeys,
+                            Datum key1, Datum key2, Datum key3, Datum key4);/* convenience routines */extern HeapTuple
SearchSysCacheCopy(intcacheId,
 
-- 
2.9.2

From 56b1eede29631df78cc622386693381b7aa76a51 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 28 Aug 2017 12:18:17 +0900
Subject: [PATCH 2/2] Cleanup negative cache of pg_class when dropping a schema

This feature in turn is triggered by catcache invalidation. This patch
provides a syscache invalidation callback to flush negative cache
entries corresponding to invalidated objects.
---src/backend/utils/cache/catcache.c |  42 +++++src/backend/utils/cache/inval.c    |   7
+-src/backend/utils/cache/syscache.c| 327 ++++++++++++++++++++++++++++---------src/include/utils/catcache.h       |   3
+4files changed, 300 insertions(+), 79 deletions(-)
 

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index bd303f3..a9ef028 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1555,6 +1555,48 @@ GetCatCacheHashValue(CatCache *cache,    return CatalogCacheComputeHashValue(cache,
cache->cc_nkeys,v1, v2, v3, v4);}
 
+/*
+ * CollectOIDsForHashValue
+ *
+ * Collect OIDs correspond to a hash value. attnum is the column to retrieve
+ * the OIDs.
+ */
+List *
+CollectOIDsForHashValue(CatCache *cache, uint32 hashValue, int attnum)
+{
+    Index         hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
+    dlist_head    *bucket = &cache->cc_bucket[hashIndex];
+    dlist_iter     iter;
+    List *ret = NIL;
+
+    /* Nothing to return before initialization */
+    if (cache->cc_tupdesc == NULL)
+        return ret;
+
+    /* Currently only OID key is supported */
+    Assert(attnum <= cache->cc_tupdesc->natts);
+    Assert(attnum < 0 ? attnum == ObjectIdAttributeNumber :
+           cache->cc_tupdesc->attrs[attnum].atttypid == OIDOID);
+
+    dlist_foreach(iter, bucket)
+    {
+        CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+        bool    isNull;
+        Datum    oid;
+
+        if (ct->dead)
+            continue;            /* ignore dead entries */
+
+        if (ct->hash_value != hashValue)
+            continue;            /* quickly skip entry if wrong hash val */
+
+        oid = heap_getattr(&ct->tuple, attnum, cache->cc_tupdesc, &isNull);
+        if (!isNull)
+            ret = lappend_oid(ret, DatumGetObjectId(oid));
+    }
+
+    return ret;
+}/* *    SearchCatCacheList
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 0e61b4b..86e6f07 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -559,9 +559,14 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)        {
InvalidateCatalogSnapshot();
+            /*
+             * Call the callbacks first so that the callbacks can access the
+             * entries corresponding to the hashValue.
+             */
+            CallSyscacheCallbacks(msg->cc.id, msg->cc.hashValue);
+            SysCacheInvalidate(msg->cc.id, msg->cc.hashValue);
-            CallSyscacheCallbacks(msg->cc.id, msg->cc.hashValue);        }    }    else if (msg->id ==
SHAREDINVALCATALOG_ID)
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index 753c5f1..7dd61cd 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -111,6 +111,16 @@*//*
+ *    struct for flushing negative cache by syscache invalidation
+ */
+typedef struct SysCacheCBParam_T
+{
+    int    trig_attnum;
+    int    target_cacheid;
+    ScanKeyData skey;
+} SysCacheCBParam;
+
+/* *        struct cachedesc: information defining a single syscache */struct cachedesc
@@ -124,6 +134,14 @@ struct cachedesc    /* relcache invalidation stuff */    AttrNumber    relattrnum;        /*
attrnumto retrieve reloid for                                 * invalidation, 0 if not needed */
 
+
+    /* catcache invalidation stuff */
+    int            trig_cacheid;    /* cache id of triggering syscache: -1 means
+                                 * no triggering cache */
+    int16        trig_attnum;    /* key column in triggering cache. Must be an
+                                 * OID */
+    int16        target_attnum;    /* corresponding column in this cache. Must be
+                                 * an OID*/};static const struct cachedesc cacheinfo[] = {
@@ -137,7 +155,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        16,
-        0
+        0,
+        -1, 0, 0    },    {AccessMethodRelationId,    /* AMNAME */        AmNameIndexId,
@@ -149,7 +168,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {AccessMethodRelationId,    /* AMOID */        AmOidIndexId,
@@ -161,7 +181,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {AccessMethodOperatorRelationId,    /* AMOPOPID */        AccessMethodOperatorIndexId,
@@ -173,7 +194,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        64,
-        0
+        0,
+        -1, 0, 0    },    {AccessMethodOperatorRelationId,    /* AMOPSTRATEGY */        AccessMethodStrategyIndexId,
@@ -185,7 +207,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_amop_amopstrategy        },
64,
-        0
+        0,
+        -1, 0, 0    },    {AccessMethodProcedureRelationId,    /* AMPROCNUM */        AccessMethodProcedureIndexId,
@@ -197,7 +220,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_amproc_amprocnum        },
16,
-        0
+        0,
+        -1, 0, 0    },    {AttributeRelationId,        /* ATTNAME */        AttributeRelidNameIndexId,
@@ -209,7 +233,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        32,
-        Anum_pg_attribute_attrelid
+        Anum_pg_attribute_attrelid,
+        -1, 0, 0    },    {AttributeRelationId,        /* ATTNUM */        AttributeRelidNumIndexId,
@@ -221,7 +246,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        128,
-        Anum_pg_attribute_attrelid
+        Anum_pg_attribute_attrelid,
+        -1, 0, 0    },    {AuthMemRelationId,            /* AUTHMEMMEMROLE */        AuthMemMemRoleIndexId,
@@ -233,7 +259,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {AuthMemRelationId,            /* AUTHMEMROLEMEM */        AuthMemRoleMemIndexId,
@@ -245,7 +272,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {AuthIdRelationId,            /* AUTHNAME */        AuthIdRolnameIndexId,
@@ -257,7 +285,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {AuthIdRelationId,            /* AUTHOID */        AuthIdOidIndexId,
@@ -269,7 +298,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {CastRelationId,            /* CASTSOURCETARGET */        CastSourceTargetIndexId,
@@ -281,7 +311,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        256,
-        0
+        0,
+        -1, 0, 0    },    {OperatorClassRelationId,    /* CLAAMNAMENSP */        OpclassAmNameNspIndexId,
@@ -293,7 +324,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {OperatorClassRelationId,    /* CLAOID */        OpclassOidIndexId,
@@ -305,7 +337,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {CollationRelationId,        /* COLLNAMEENCNSP */        CollationNameEncNspIndexId,
@@ -317,7 +350,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {CollationRelationId,        /* COLLOID */        CollationOidIndexId,
@@ -329,7 +363,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {ConversionRelationId,        /* CONDEFAULT */        ConversionDefaultIndexId,
@@ -341,7 +376,8 @@ static const struct cachedesc cacheinfo[] = {            ObjectIdAttributeNumber,        },
8,
-        0
+        0,
+        -1, 0, 0    },    {ConversionRelationId,        /* CONNAMENSP */        ConversionNameNspIndexId,
@@ -353,7 +389,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {ConstraintRelationId,        /* CONSTROID */        ConstraintOidIndexId,
@@ -365,7 +402,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        16,
-        0
+        0,
+        -1, 0, 0    },    {ConversionRelationId,        /* CONVOID */        ConversionOidIndexId,
@@ -377,7 +415,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {DatabaseRelationId,        /* DATABASEOID */        DatabaseOidIndexId,
@@ -389,7 +428,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {DefaultAclRelationId,        /* DEFACLROLENSPOBJ */        DefaultAclRoleNspObjIndexId,
@@ -401,7 +441,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {EnumRelationId,            /* ENUMOID */        EnumOidIndexId,
@@ -413,7 +454,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {EnumRelationId,            /* ENUMTYPOIDNAME */        EnumTypIdLabelIndexId,
@@ -425,7 +467,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {EventTriggerRelationId,    /* EVENTTRIGGERNAME */        EventTriggerNameIndexId,
@@ -437,7 +480,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {EventTriggerRelationId,    /* EVENTTRIGGEROID */        EventTriggerOidIndexId,
@@ -449,7 +493,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {ForeignDataWrapperRelationId,    /* FOREIGNDATAWRAPPERNAME */
ForeignDataWrapperNameIndexId,
@@ -461,7 +506,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {ForeignDataWrapperRelationId,    /* FOREIGNDATAWRAPPEROID */
ForeignDataWrapperOidIndexId,
@@ -473,7 +519,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {ForeignServerRelationId,    /* FOREIGNSERVERNAME */        ForeignServerNameIndexId,
@@ -485,7 +532,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {ForeignServerRelationId,    /* FOREIGNSERVEROID */        ForeignServerOidIndexId,
@@ -497,7 +545,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {ForeignTableRelationId,    /* FOREIGNTABLEREL */        ForeignTableRelidIndexId,
@@ -509,7 +558,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {IndexRelationId,            /* INDEXRELID */        IndexRelidIndexId,
@@ -521,7 +571,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        64,
-        0
+        0,
+        -1, 0, 0    },    {LanguageRelationId,        /* LANGNAME */        LanguageNameIndexId,
@@ -533,7 +584,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {LanguageRelationId,        /* LANGOID */        LanguageOidIndexId,
@@ -545,7 +597,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {NamespaceRelationId,        /* NAMESPACENAME */        NamespaceNameIndexId,
@@ -557,7 +610,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {NamespaceRelationId,        /* NAMESPACEOID */        NamespaceOidIndexId,
@@ -569,7 +623,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        16,
-        0
+        0,
+        -1, 0, 0    },    {OperatorRelationId,        /* OPERNAMENSP */        OperatorNameNspIndexId,
@@ -581,7 +636,8 @@ static const struct cachedesc cacheinfo[] = {            Anum_pg_operator_oprnamespace        },
   256,
 
-        0
+        0,
+        -1, 0, 0    },    {OperatorRelationId,        /* OPEROID */        OperatorOidIndexId,
@@ -593,7 +649,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        32,
-        0
+        0,
+        -1, 0, 0    },    {OperatorFamilyRelationId,    /* OPFAMILYAMNAMENSP */        OpfamilyAmNameNspIndexId,
@@ -605,7 +662,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {OperatorFamilyRelationId,    /* OPFAMILYOID */        OpfamilyOidIndexId,
@@ -617,7 +675,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {PartitionedRelationId,        /* PARTRELID */        PartitionedRelidIndexId,
@@ -629,7 +688,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        32,
-        0
+        0,
+        -1, 0, 0    },    {ProcedureRelationId,        /* PROCNAMEARGSNSP */        ProcedureNameArgsNspIndexId,
@@ -641,7 +701,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        128,
-        0
+        0,
+        -1, 0, 0    },    {ProcedureRelationId,        /* PROCOID */        ProcedureOidIndexId,
@@ -653,7 +714,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        128,
-        0
+        0,
+        -1, 0, 0    },    {PublicationRelationId,        /* PUBLICATIONNAME */        PublicationNameIndexId,
@@ -665,7 +727,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {PublicationRelationId,        /* PUBLICATIONOID */        PublicationObjectIndexId,
@@ -677,7 +740,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {PublicationRelRelationId,    /* PUBLICATIONREL */        PublicationRelObjectIndexId,
@@ -689,7 +753,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        64,
-        0
+        0,
+        -1, 0, 0    },    {PublicationRelRelationId,    /* PUBLICATIONRELMAP */
PublicationRelPrrelidPrpubidIndexId,
@@ -701,7 +766,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        64,
-        0
+        0,
+        -1, 0, 0    },    {RangeRelationId,            /* RANGETYPE */        RangeTypidIndexId,
@@ -713,7 +779,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {RelationRelationId,        /* RELNAMENSP */        ClassNameNspIndexId,
@@ -725,7 +792,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        128,
-        0
+        0,
+        NAMESPACEOID, ObjectIdAttributeNumber, Anum_pg_class_relnamespace    },    {RelationRelationId,        /*
RELOID*/        ClassOidIndexId,
 
@@ -737,7 +805,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        128,
-        0
+        0,
+        -1, 0, 0    },    {ReplicationOriginRelationId,    /* REPLORIGIDENT */        ReplicationOriginIdentIndex,
@@ -749,7 +818,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        16,
-        0
+        0,
+        -1, 0, 0    },    {ReplicationOriginRelationId,    /* REPLORIGNAME */        ReplicationOriginNameIndex,
@@ -761,7 +831,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        16,
-        0
+        0,
+        -1, 0, 0    },    {RewriteRelationId,            /* RULERELNAME */        RewriteRelRulenameIndexId,
@@ -773,7 +844,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        8,
-        0
+        0,
+        -1, 0, 0    },    {SequenceRelationId,        /* SEQRELID */        SequenceRelidIndexId,
@@ -785,7 +857,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        32,
-        0
+        0,
+        -1, 0, 0    },    {StatisticExtRelationId,    /* STATEXTNAMENSP */        StatisticExtNameIndexId,
@@ -797,7 +870,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {StatisticExtRelationId,    /* STATEXTOID */        StatisticExtOidIndexId,
@@ -809,7 +883,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {StatisticRelationId,        /* STATRELATTINH */        StatisticRelidAttnumInhIndexId,
@@ -821,7 +896,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        128,
-        Anum_pg_statistic_starelid
+        Anum_pg_statistic_starelid,
+        -1, 0, 0    },    {SubscriptionRelationId,    /* SUBSCRIPTIONNAME */        SubscriptionNameIndexId,
@@ -833,7 +909,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {SubscriptionRelationId,    /* SUBSCRIPTIONOID */        SubscriptionObjectIndexId,
@@ -845,7 +922,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        4,
-        0
+        0,
+        -1, 0, 0    },    {SubscriptionRelRelationId, /* SUBSCRIPTIONRELMAP */
SubscriptionRelSrrelidSrsubidIndexId,
@@ -857,7 +935,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        64,
-        0
+        0,
+        -1, 0, 0    },    {TableSpaceRelationId,        /* TABLESPACEOID */        TablespaceOidIndexId,
@@ -869,7 +948,8 @@ static const struct cachedesc cacheinfo[] = {            0,        },        4,
-        0
+        0,
+        -1, 0, 0    },    {TransformRelationId,        /* TRFOID */        TransformOidIndexId,
@@ -881,7 +961,8 @@ static const struct cachedesc cacheinfo[] = {            0,        },        16,
-        0
+        0,
+        -1, 0, 0    },    {TransformRelationId,        /* TRFTYPELANG */        TransformTypeLangIndexId,
@@ -893,7 +974,8 @@ static const struct cachedesc cacheinfo[] = {            0,        },        16,
-        0
+        0,
+        -1, 0, 0    },    {TSConfigMapRelationId,        /* TSCONFIGMAP */        TSConfigMapIndexId,
@@ -905,7 +987,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {TSConfigRelationId,        /* TSCONFIGNAMENSP */        TSConfigNameNspIndexId,
@@ -917,7 +1000,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {TSConfigRelationId,        /* TSCONFIGOID */        TSConfigOidIndexId,
@@ -929,7 +1013,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {TSDictionaryRelationId,    /* TSDICTNAMENSP */        TSDictionaryNameNspIndexId,
@@ -941,7 +1026,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {TSDictionaryRelationId,    /* TSDICTOID */        TSDictionaryOidIndexId,
@@ -953,7 +1039,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {TSParserRelationId,        /* TSPARSERNAMENSP */        TSParserNameNspIndexId,
@@ -965,7 +1052,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {TSParserRelationId,        /* TSPARSEROID */        TSParserOidIndexId,
@@ -977,7 +1065,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {TSTemplateRelationId,        /* TSTEMPLATENAMENSP */        TSTemplateNameNspIndexId,
@@ -989,7 +1078,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {TSTemplateRelationId,        /* TSTEMPLATEOID */        TSTemplateOidIndexId,
@@ -1001,7 +1091,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {TypeRelationId,            /* TYPENAMENSP */        TypeNameNspIndexId,
@@ -1013,7 +1104,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        64,
-        0
+        0,
+        -1, 0, 0    },    {TypeRelationId,            /* TYPEOID */        TypeOidIndexId,
@@ -1025,7 +1117,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        64,
-        0
+        0,
+        -1, 0, 0    },    {UserMappingRelationId,        /* USERMAPPINGOID */        UserMappingOidIndexId,
@@ -1037,7 +1130,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    },    {UserMappingRelationId,        /* USERMAPPINGUSERSERVER */
UserMappingUserServerIndexId,
@@ -1049,7 +1143,8 @@ static const struct cachedesc cacheinfo[] = {            0        },        2,
-        0
+        0,
+        -1, 0, 0    }};
@@ -1082,7 +1177,8 @@ static ScanKeyData    oideqscankey; /* ScanKey for reloid match  */static int
oid_compare(constvoid *a, const void *b);static void SysCacheRelInvalCallback(Datum arg, Oid reloid);
 
-
+static void SysCacheSysCacheInvalCallback(Datum arg, int cacheid,
+                                          uint32 hashvalue);/* * InitCatalogCache - initialize the caches *
@@ -1140,6 +1236,34 @@ InitCatalogCache(void)                cacheinfo[cacheId].relattrnum;
relinval_callback_count++;       }
 
+
+        /*
+         * If this syscache has syscache invalidation trigger, register
+         * it.
+         */
+        if (cacheinfo[cacheId].trig_cacheid >= 0)
+        {
+            SysCacheCBParam *param;
+
+            param = MemoryContextAlloc(CacheMemoryContext,
+                                       sizeof(SysCacheCBParam));
+            param->target_cacheid = cacheId;
+
+            /*
+             * XXXX: Create a scankeydata for OID comparison. We don't have a
+             * means to check the type of the column in the system catalog at
+             * this time. So we have to believe the definition.
+             */
+            fmgr_info_cxt(F_OIDEQ, ¶m->skey.sk_func, CacheMemoryContext);
+            param->skey.sk_attno = cacheinfo[cacheId].target_attnum;
+            param->trig_attnum = cacheinfo[cacheId].trig_attnum;
+            param->skey.sk_strategy = BTEqualStrategyNumber;
+            param->skey.sk_subtype = InvalidOid;
+            param->skey.sk_collation = InvalidOid;
+            CacheRegisterSyscacheCallback(cacheinfo[cacheId].trig_cacheid,
+                                          SysCacheSysCacheInvalCallback,
+                                          PointerGetDatum(param));
+        }    }    Assert(SysCacheRelationOidSize <= lengthof(SysCacheRelationOid));
@@ -1623,6 +1747,53 @@ RelationInvalidatesSnapshotsOnly(Oid relid)}/*
+ * SysCacheSysCacheInvalCallback
+ *
+ * Callback function for negative cache flushing by syscache invalidation.
+ * Fetches an OID (not restricted to system oid column) from the invalidated
+ * tuple and flushes negative entries that matches the OID in the target
+ * syscache.
+ */
+static void
+SysCacheSysCacheInvalCallback(Datum arg, int cacheid, uint32 hashValue)
+{
+    SysCacheCBParam *param;
+    CatCache    *trigger_cache;        /* triggering catcache */
+    CatCache    *target_cache;        /* target catcache */
+    List *oids;
+    ListCell *lc;
+    int            trigger_cacheid = cacheid;
+    int            target_cacheid;
+
+    param = (SysCacheCBParam *)DatumGetPointer(arg);
+    target_cacheid = param->target_cacheid;
+
+    trigger_cache = SysCache[trigger_cacheid];
+    target_cache = SysCache[target_cacheid];
+
+    /*
+     * Collect candidate OIDs for target syscache entries. The result contains
+     * just one value for most cases, or two or more for the case hashvalue
+     * has synonyms. At least one of them is the right OID but it is
+     * undistinguishable from others by the given hash value.
+     * As the result some unnecessary entries may be flushed but it won't harm
+     * so much than letting them bloat catcaches.
+     */
+    oids =
+        CollectOIDsForHashValue(trigger_cache, hashValue, param->trig_attnum);
+
+    foreach (lc, oids)
+    {
+        ScanKeyData skey;
+        Oid oid = lfirst_oid (lc);
+
+        memcpy(&skey, ¶m->skey, sizeof(skey));
+        skey.sk_argument = ObjectIdGetDatum(oid);
+        CleanupCatCacheNegEntries(target_cache, &skey);
+    }
+}
+
+/* * Test whether a relation has a system cache. */bool
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7564f42..562810f 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -213,6 +213,9 @@ extern uint32 GetCatCacheHashValue(CatCache *cache,                     Datum v1, Datum v2,
           Datum v3, Datum v4);
 
+extern List *CollectOIDsForHashValue(CatCache *cache,
+                                     uint32 hashValue, int attnum);
+extern CatCList *SearchCatCacheList(CatCache *cache, int nkeys,                   Datum v1, Datum v2,
Datum v3, Datum v4);
 
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Protect syscache from bloating with negative cache entries

От
Michael Paquier
Дата:
On Tue, Oct 31, 2017 at 6:46 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> This is a rebased version of the patch.

As far as I can see, the patch still applies, compiles, and got no
reviews. So moved to next CF.
-- 
Michael


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Wed, Nov 29, 2017 at 8:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Oct 31, 2017 at 6:46 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> This is a rebased version of the patch.
>
> As far as I can see, the patch still applies, compiles, and got no
> reviews. So moved to next CF.

I think we have to mark this as returned with feedback or rejected for
the reasons mentioned here:

http://postgr.es/m/CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Michael Paquier
Дата:
On Thu, Nov 30, 2017 at 12:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 29, 2017 at 8:25 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Tue, Oct 31, 2017 at 6:46 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> This is a rebased version of the patch.
>>
>> As far as I can see, the patch still applies, compiles, and got no
>> reviews. So moved to next CF.
>
> I think we have to mark this as returned with feedback or rejected for
> the reasons mentioned here:
>
> http://postgr.es/m/CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com

Good point. I forgot this bit. Thanks for mentioning it I am switching
the patch as returned with feedback.
-- 
Michael


Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Thu, Nov 30, 2017 at 12:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think we have to mark this as returned with feedback or rejected for
>> the reasons mentioned here:
>> http://postgr.es/m/CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com

> Good point. I forgot this bit. Thanks for mentioning it I am switching
> the patch as returned with feedback.

We had a bug report just today that seemed to me to trace to relcache
bloat:
https://www.postgresql.org/message-id/flat/20171129100649.1473.73990%40wrigleys.postgresql.org

ISTM that there's definitely work to be done here, but as I said upthread,
I think we need a more holistic approach than just focusing on negative
catcache entries, or even just catcache entries.

The thing that makes me uncomfortable about this is that we used to have a
catcache size limitation mechanism, and ripped it out because it had too
much overhead (see commit 8b9bc234a).  I'm not sure how we can avoid that
problem within a fresh implementation.
        regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Wed, Nov 29, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The thing that makes me uncomfortable about this is that we used to have a
> catcache size limitation mechanism, and ripped it out because it had too
> much overhead (see commit 8b9bc234a).  I'm not sure how we can avoid that
> problem within a fresh implementation.

At the risk of beating a dead horse, I still think that the amount of
wall clock time that has elapsed since an entry was last accessed is
very relevant.  The problem with a fixed maximum size is that you can
hit it arbitrarily frequently; time-based expiration solves that
problem.  It allows backends that are actively using a lot of stuff to
hold on to as many cache entries as they need, while forcing backends
that have moved on to a different set of tables -- or that are
completely idle -- to let go of cache entries that are no longer being
actively used.  I think that's what we want.  Nobody wants to keep the
cache size small when a big cache is necessary for good performance,
but what people do want to avoid is having long-running backends
eventually accumulate huge numbers of cache entries most of which
haven't been touched in hours or, maybe, weeks.

To put that another way, we should only hang on to a cache entry for
so long as the bytes of memory that it consumes are more valuable than
some other possible use of those bytes of memory.  That is very likely
to be true when we've accessed those bytes recently, but progressively
less likely to be true the more time has passed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Nov 30, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Nov 29, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The thing that makes me uncomfortable about this is that we used to have a
>>> catcache size limitation mechanism, and ripped it out because it had too
>>> much overhead (see commit 8b9bc234a).  I'm not sure how we can avoid that
>>> problem within a fresh implementation.
>
>> At the risk of beating a dead horse, I still think that the amount of
>> wall clock time that has elapsed since an entry was last accessed is
>> very relevant.
>
> While I don't object to that statement, I'm not sure how it helps us
> here.  If we couldn't afford DLMoveToFront(), doing a gettimeofday()
> during each syscache access is surely right out.

Well, yeah, that would be insane.  But I think even something very
rough could work well enough.  I think our goal should be to eliminate
cache entries that are have gone unused for many *minutes*, and
there's no urgency about getting it to any sort of exact value.  For
non-idle backends, using the most recent statement start time as a
proxy would probably be plenty good enough.  Idle backends might need
a bit more thought.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2017-12-01 16:20:44 -0500, Robert Haas wrote:
> Well, yeah, that would be insane.  But I think even something very
> rough could work well enough.  I think our goal should be to eliminate
> cache entries that are have gone unused for many *minutes*, and
> there's no urgency about getting it to any sort of exact value.  For
> non-idle backends, using the most recent statement start time as a
> proxy would probably be plenty good enough.  Idle backends might need
> a bit more thought.

Our timer framework is flexible enough that we can install a
once-a-minute timer without much overhead. That timer could increment a
'cache generation' integer. Upon cache access we write the current
generation into relcache / syscache (and potentially also plancache?)
entries. Not entirely free, but cheap enough. In those once-a-minute
passes entries that haven't been touched in X cycles get pruned.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2017-12-01 16:20:44 -0500, Robert Haas wrote:
>> Well, yeah, that would be insane.  But I think even something very
>> rough could work well enough.  I think our goal should be to eliminate
>> cache entries that are have gone unused for many *minutes*, and
>> there's no urgency about getting it to any sort of exact value.  For
>> non-idle backends, using the most recent statement start time as a
>> proxy would probably be plenty good enough.  Idle backends might need
>> a bit more thought.

> Our timer framework is flexible enough that we can install a
> once-a-minute timer without much overhead. That timer could increment a
> 'cache generation' integer. Upon cache access we write the current
> generation into relcache / syscache (and potentially also plancache?)
> entries. Not entirely free, but cheap enough. In those once-a-minute
> passes entries that haven't been touched in X cycles get pruned.

I have no faith in either of these proposals, because they both assume
that the problem only arises over the course of many minutes.  In the
recent complaint about pg_dump causing relcache bloat, it probably does
not take nearly that long for the bloat to occur.

Maybe you could make it work on the basis of number of cache accesses,
or some other normalized-to-workload-not-wall-clock time reference.

            regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2017-12-01 16:40:23 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-12-01 16:20:44 -0500, Robert Haas wrote:
> >> Well, yeah, that would be insane.  But I think even something very
> >> rough could work well enough.  I think our goal should be to eliminate
> >> cache entries that are have gone unused for many *minutes*, and
> >> there's no urgency about getting it to any sort of exact value.  For
> >> non-idle backends, using the most recent statement start time as a
> >> proxy would probably be plenty good enough.  Idle backends might need
> >> a bit more thought.
> 
> > Our timer framework is flexible enough that we can install a
> > once-a-minute timer without much overhead. That timer could increment a
> > 'cache generation' integer. Upon cache access we write the current
> > generation into relcache / syscache (and potentially also plancache?)
> > entries. Not entirely free, but cheap enough. In those once-a-minute
> > passes entries that haven't been touched in X cycles get pruned.
> 
> I have no faith in either of these proposals, because they both assume
> that the problem only arises over the course of many minutes.  In the
> recent complaint about pg_dump causing relcache bloat, it probably does
> not take nearly that long for the bloat to occur.

To me that's a bit of a different problem than what I was discussing
here.  It also actually doesn't seem that hard - if your caches are
growing fast, you'll continually get hash-resizing of the
various. Adding cache-pruning to the resizing code doesn't seem hard,
and wouldn't add meaningful overhead.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2017-12-01 16:40:23 -0500, Tom Lane wrote:
>> I have no faith in either of these proposals, because they both assume
>> that the problem only arises over the course of many minutes.  In the
>> recent complaint about pg_dump causing relcache bloat, it probably does
>> not take nearly that long for the bloat to occur.

> To me that's a bit of a different problem than what I was discussing
> here.  It also actually doesn't seem that hard - if your caches are
> growing fast, you'll continually get hash-resizing of the
> various. Adding cache-pruning to the resizing code doesn't seem hard,
> and wouldn't add meaningful overhead.

That's an interesting way to think about it, as well, though I'm not
sure it's quite that simple.  If you tie this to cache resizing then
the cache will have to grow up to the newly increased size before
you'll prune it again.  That doesn't sound like it will lead to nice
steady-state behavior.

            regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2017-12-01 17:03:28 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-12-01 16:40:23 -0500, Tom Lane wrote:
> >> I have no faith in either of these proposals, because they both assume
> >> that the problem only arises over the course of many minutes.  In the
> >> recent complaint about pg_dump causing relcache bloat, it probably does
> >> not take nearly that long for the bloat to occur.
> 
> > To me that's a bit of a different problem than what I was discussing
> > here.  It also actually doesn't seem that hard - if your caches are
> > growing fast, you'll continually get hash-resizing of the
> > various. Adding cache-pruning to the resizing code doesn't seem hard,
> > and wouldn't add meaningful overhead.
> 
> That's an interesting way to think about it, as well, though I'm not
> sure it's quite that simple.  If you tie this to cache resizing then
> the cache will have to grow up to the newly increased size before
> you'll prune it again.  That doesn't sound like it will lead to nice
> steady-state behavior.

Yea, it's not perfect - but if we do pruning both at resize *and* on
regular intervals, like once-a-minute as I was suggesting, I don't think
it's that bad. The steady state won't be reached within seconds, true,
but the negative consequences of only attempting to shrink the cache
upon resizing when the cache size is growing fast anyway doesn't seem
that large.

I don't think we need to be super accurate here, there just needs to be
*some* backpressure.

I've had cases in the past where just occasionally blasting the cache
away would've been good enough.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Fri, 1 Dec 2017 14:12:20 -0800, Andres Freund <andres@anarazel.de> wrote in
<20171201221220.z5e6wtlpl264wzik@alap3.anarazel.de>
> On 2017-12-01 17:03:28 -0500, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2017-12-01 16:40:23 -0500, Tom Lane wrote:
> > >> I have no faith in either of these proposals, because they both assume
> > >> that the problem only arises over the course of many minutes.  In the
> > >> recent complaint about pg_dump causing relcache bloat, it probably does
> > >> not take nearly that long for the bloat to occur.
> > 
> > > To me that's a bit of a different problem than what I was discussing
> > > here.  It also actually doesn't seem that hard - if your caches are
> > > growing fast, you'll continually get hash-resizing of the
> > > various. Adding cache-pruning to the resizing code doesn't seem hard,
> > > and wouldn't add meaningful overhead.
> > 
> > That's an interesting way to think about it, as well, though I'm not
> > sure it's quite that simple.  If you tie this to cache resizing then
> > the cache will have to grow up to the newly increased size before
> > you'll prune it again.  That doesn't sound like it will lead to nice
> > steady-state behavior.
> 
> Yea, it's not perfect - but if we do pruning both at resize *and* on
> regular intervals, like once-a-minute as I was suggesting, I don't think
> it's that bad. The steady state won't be reached within seconds, true,
> but the negative consequences of only attempting to shrink the cache
> upon resizing when the cache size is growing fast anyway doesn't seem
> that large.
> 
> I don't think we need to be super accurate here, there just needs to be
> *some* backpressure.
> 
> I've had cases in the past where just occasionally blasting the cache
> away would've been good enough.

Thank you very much for the valuable suggestions. I still would
like to solve this problem and the
a-counter-freely-running-in-minute(or several seconds)-resolution
and pruning-too-long-unaccessed-entries-on-resizing seems to me
to work enough for at least several known bloat cases. This still
has a defect that this is not workable for a very quick
bloating. I'll try thinking about the remaining issue.

If no one has immediate objection to the direction, I'll come up
with an implementation.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Wed, Dec 13, 2017 at 11:20 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Thank you very much for the valuable suggestions. I still would
> like to solve this problem and the
> a-counter-freely-running-in-minute(or several seconds)-resolution
> and pruning-too-long-unaccessed-entries-on-resizing seems to me
> to work enough for at least several known bloat cases. This still
> has a defect that this is not workable for a very quick
> bloating. I'll try thinking about the remaining issue.

I'm not sure we should regard very quick bloating as a problem in need
of solving.  Doesn't that just mean we need the cache to be bigger, at
least temporarily?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2017-12-16 22:25:48 -0500, Robert Haas wrote:
> On Wed, Dec 13, 2017 at 11:20 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Thank you very much for the valuable suggestions. I still would
> > like to solve this problem and the
> > a-counter-freely-running-in-minute(or several seconds)-resolution
> > and pruning-too-long-unaccessed-entries-on-resizing seems to me
> > to work enough for at least several known bloat cases. This still
> > has a defect that this is not workable for a very quick
> > bloating. I'll try thinking about the remaining issue.
> 
> I'm not sure we should regard very quick bloating as a problem in need
> of solving.  Doesn't that just mean we need the cache to be bigger, at
> least temporarily?

Leaving that aside, is that actually not at least to a good degree,
solved by that problem? By bumping the generation on hash resize, we
have recency information we can take into account.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund <andres@anarazel.de> wrote:
>> I'm not sure we should regard very quick bloating as a problem in need
>> of solving.  Doesn't that just mean we need the cache to be bigger, at
>> least temporarily?
>
> Leaving that aside, is that actually not at least to a good degree,
> solved by that problem? By bumping the generation on hash resize, we
> have recency information we can take into account.

I agree that we can do it.  I'm just not totally sure it's a good
idea.  I'm also not totally sure it's a bad idea, either.  That's why
I asked the question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2017-12-17 19:23:45 -0500, Robert Haas wrote:
> On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund <andres@anarazel.de> wrote:
> >> I'm not sure we should regard very quick bloating as a problem in need
> >> of solving.  Doesn't that just mean we need the cache to be bigger, at
> >> least temporarily?
> >
> > Leaving that aside, is that actually not at least to a good degree,
> > solved by that problem? By bumping the generation on hash resize, we
> > have recency information we can take into account.
>
> I agree that we can do it.  I'm just not totally sure it's a good
> idea.  I'm also not totally sure it's a bad idea, either.  That's why
> I asked the question.

I'm not 100% convinced either - but I also don't think it matters all
that terribly much. As long as the overall hash hit rate is decent,
minor increases in the absolute number of misses don't really matter
that much for syscache imo.  I'd personally go for something like:

1) When about to resize, check if there's entries of a generation -2
   around.

   Don't resize if more than 15% of entries could be freed. Also, stop
   reclaiming at that threshold, to avoid unnecessary purging cache
   entries.

   Using two generations allows a bit more time for cache entries to
   marked as fresh before resizing next.

2) While resizing increment generation count by one.

3) Once a minute, increment generation count by one.


The one thing I'm not quite have a good handle upon is how much, and if
any, cache reclamation to do at 3). We don't really want to throw away
all the caches just because a connection has been idle for a few
minutes, in a connection pool that can happen occasionally. I think I'd
for now *not* do any reclamation except at resize boundaries.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Mon, Dec 18, 2017 at 11:46 AM, Andres Freund <andres@anarazel.de> wrote:
> I'm not 100% convinced either - but I also don't think it matters all
> that terribly much. As long as the overall hash hit rate is decent,
> minor increases in the absolute number of misses don't really matter
> that much for syscache imo.  I'd personally go for something like:
>
> 1) When about to resize, check if there's entries of a generation -2
>    around.
>
>    Don't resize if more than 15% of entries could be freed. Also, stop
>    reclaiming at that threshold, to avoid unnecessary purging cache
>    entries.
>
>    Using two generations allows a bit more time for cache entries to
>    marked as fresh before resizing next.
>
> 2) While resizing increment generation count by one.
>
> 3) Once a minute, increment generation count by one.
>
>
> The one thing I'm not quite have a good handle upon is how much, and if
> any, cache reclamation to do at 3). We don't really want to throw away
> all the caches just because a connection has been idle for a few
> minutes, in a connection pool that can happen occasionally. I think I'd
> for now *not* do any reclamation except at resize boundaries.

My starting inclination was almost the opposite.  I think that you
might be right that a minute or two of idle time isn't sufficient
reason to flush our local cache, but I'd be inclined to fix that by
incrementing the generation count every 10 minutes or so rather than
every minute, and still flush things more then 1 generation old.  The
reason for that is that I think we should ensure that the system
doesn't sit there idle forever with a giant cache.  If it's not using
those cache entries, I'd rather have it discard them and rebuild the
cache when it becomes active again.

Now, I also see that your point about trying to clean up before
resizing.  That does seem like a good idea, although we have to be
careful not to be too eager to clean up there, or we'll just result in
artificially limiting the cache size when it's unwise to do so.  But I
guess that's what you meant by "Also, stop reclaiming at that
threshold, to avoid unnecessary purging cache entries."  I think the
idea you are proposing is that:

1. The first time we are due to expand the hash table, we check
whether we can forestall that expansion by doing a cleanup; if so, we
do that instead.

2. After that, we just expand.

That seems like a fairly good idea, although it might be a better idea
to allow cleanup if enough time has passed.  If we hit the expansion
threshold twice an hour apart, there's no reason not to try cleanup
again.

Generally, the way I'm viewing this is that a syscache entry means
paying memory to save CPU time.  Each 8kB of memory we use to store
system cache entries is one less block we have for the OS page cache
to hold onto our data blocks.  If we had an oracle (the kind from
Delphi, not Redwood City) that told us with perfect accuracy when to
discard syscache entries, it would throw away syscache entries
whenever the marginal execution-time performance we could buy from
another 8kB in the page cache is greater than the marginal
execution-time performance we could buy from those syscache entries.
In reality, it's hard to know which of those things is of greater
value.  If the system isn't meaningfully memory-constrained, we ought
to just always hang onto the syscache entries, as we do today, but
it's hard to know that.  I think the place where this really becomes a
problem is on system with hundreds of connections + thousands of
tables + connection pooling; without some back-pressure, every backend
eventually caches everything, putting the system under severe memory
pressure for basically no performance gain.  Each new use of the
connection is probably for a limited set of tables, and only those
tables really syscache entries; holding onto things used long in the
past doesn't save enough to justify the memory used.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Mon, 18 Dec 2017 12:14:24 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoaWLBzUasvVs-q=dfBr3pLWSUCQnbqLk-MT7iX4eyrinA@mail.gmail.com>
> On Mon, Dec 18, 2017 at 11:46 AM, Andres Freund <andres@anarazel.de> wrote:
> > I'm not 100% convinced either - but I also don't think it matters all
> > that terribly much. As long as the overall hash hit rate is decent,
> > minor increases in the absolute number of misses don't really matter
> > that much for syscache imo.  I'd personally go for something like:
> >
> > 1) When about to resize, check if there's entries of a generation -2
> >    around.
> >
> >    Don't resize if more than 15% of entries could be freed. Also, stop
> >    reclaiming at that threshold, to avoid unnecessary purging cache
> >    entries.
> >
> >    Using two generations allows a bit more time for cache entries to
> >    marked as fresh before resizing next.
> >
> > 2) While resizing increment generation count by one.
> >
> > 3) Once a minute, increment generation count by one.
> >
> >
> > The one thing I'm not quite have a good handle upon is how much, and if
> > any, cache reclamation to do at 3). We don't really want to throw away
> > all the caches just because a connection has been idle for a few
> > minutes, in a connection pool that can happen occasionally. I think I'd
> > for now *not* do any reclamation except at resize boundaries.
> 
> My starting inclination was almost the opposite.  I think that you
> might be right that a minute or two of idle time isn't sufficient
> reason to flush our local cache, but I'd be inclined to fix that by
> incrementing the generation count every 10 minutes or so rather than
> every minute, and still flush things more then 1 generation old.  The
> reason for that is that I think we should ensure that the system
> doesn't sit there idle forever with a giant cache.  If it's not using
> those cache entries, I'd rather have it discard them and rebuild the
> cache when it becomes active again.

I see three kinds of syscache entries.

A. An entry for an actually existing object.

  This is literally a syscache entry. This kind of entry is not
  necessary to be removed but can be removed after ignorance for
  a certain period of time.

B. An entry for an object which once existed but no longer.

  This can be removed any time after the removal of the object
  and is a main cause of stats bloat or relcache bloat which are
  the motive of this thread. We can know whether the entries of
  this kind are removable using cache invalidation
  mechanism. (the patch upthread)

  We can queue the oids that specify the entries to remove, then
  actually remove at the next resize. (And this also could be
  another cause of bloat. So we could forcibly flush a hash when
  the oid list becomes longer than some threashold.)

C. An entry for a just non-existent objects.

  I'm not sure how we should treat this since the necessity of a
  entry of the kind purely stands on whether the entry will be
  accessed sometime. But we could put the same assumption to A.


> Now, I also see that your point about trying to clean up before
> resizing.  That does seem like a good idea, although we have to be
> careful not to be too eager to clean up there, or we'll just result in
> artificially limiting the cache size when it's unwise to do so.  But I
> guess that's what you meant by "Also, stop reclaiming at that
> threshold, to avoid unnecessary purging cache entries."  I think the
> idea you are proposing is that:
> 
> 1. The first time we are due to expand the hash table, we check
> whether we can forestall that expansion by doing a cleanup; if so, we
> do that instead.
> 
> 2. After that, we just expand.
> 
> That seems like a fairly good idea, although it might be a better idea
> to allow cleanup if enough time has passed.  If we hit the expansion
> threshold twice an hour apart, there's no reason not to try cleanup
> again.

Aa session with intermittently executes queries run in a very
short time could be considered as an example workload where
cleanup with such criteria is unwelcomed. But syscache won't
bloat in the case.

> Generally, the way I'm viewing this is that a syscache entry means
> paying memory to save CPU time.  Each 8kB of memory we use to store
> system cache entries is one less block we have for the OS page cache
> to hold onto our data blocks.  If we had an oracle (the kind from

Sure

> Delphi, not Redwood City) that told us with perfect accuracy when to
> discard syscache entries, it would throw away syscache entries

Except for the B in the aboves. The logic seems somewhat alien to
the time-based cleanup but this can be the measure for quick
bloat of some syscahces.

> whenever the marginal execution-time performance we could buy from
> another 8kB in the page cache is greater than the marginal
> execution-time performance we could buy from those syscache entries.
> In reality, it's hard to know which of those things is of greater
> value.  If the system isn't meaningfully memory-constrained, we ought
> to just always hang onto the syscache entries, as we do today, but
> it's hard to know that.  I think the place where this really becomes a
> problem is on system with hundreds of connections + thousands of
> tables + connection pooling; without some back-pressure, every backend
> eventually caches everything, putting the system under severe memory
> pressure for basically no performance gain.  Each new use of the
> connection is probably for a limited set of tables, and only those
> tables really syscache entries; holding onto things used long in the
> past doesn't save enough to justify the memory used.

Agreed. The following is the whole image of the measure for
syscache bloat considering "quick bloat". (I still think it is
wanted under some situations.)


1. If a removal of any objects that make some syscache entries
  stale (this cannot be checked without scanning whole a hash so
  just queue it into, for exameple, recently_removed_relations
  OID hash.)

2. If the number of the oid-hash entries reasches 1000 or 10000
  (mmm. quite arbitrary..), Immediately clean up syscaches that
  accepts/needs removed-reloid cleanup.  (The oid hash might be
  needed separately for each target cache to avoid readandunt
  scan, or to get rid a kind of generation management in the oid
  hash.)

3.
> 1. The first time we are due to expand the hash table, we check
> whether we can forestall that expansion by doing a cleanup; if so, we
> do that instead.

  And if there's any entry in the removed-reloid hash it is
  considered while cleanup.

4.
> 2. After that, we just expand.
> 
> That seems like a fairly good idea, although it might be a better idea
> to allow cleanup if enough time has passed.  If we hit the expansion
> threshold twice an hour apart, there's no reason not to try cleanup
> again.

1 + 2 and 3 + 4 can be implemented as separate patches and I'll
do the latter first.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I see three kinds of syscache entries.
>
> A. An entry for an actually existing object.
> B. An entry for an object which once existed but no longer.
> C. An entry for a just non-existent objects.

I'm not convinced that it's useful to divide things up this way.
Regardless of whether the syscache entries is a positive entry, a
negative entry for a dropped object, or a negative energy for an
object that never existed in the first place, it's valuable if it's
likely to get used again and worthless if not.  Positive entries may
get used repeatedly, or not; negative entries may get used repeatedly,
or not.

>> Generally, the way I'm viewing this is that a syscache entry means
>> paying memory to save CPU time.  Each 8kB of memory we use to store
>> system cache entries is one less block we have for the OS page cache
>> to hold onto our data blocks.  If we had an oracle (the kind from
>
> Sure
>
>> Delphi, not Redwood City) that told us with perfect accuracy when to
>> discard syscache entries, it would throw away syscache entries
>
> Except for the B in the aboves. The logic seems somewhat alien to
> the time-based cleanup but this can be the measure for quick
> bloat of some syscahces.

I guess I still don't see why B is different.  If somebody sits there
and runs queries against non-existent table names at top speed, maybe
they'll query the same non-existent table entries more than once, in
which case keeping the negative entries for the non-existent table
names around until they stop doing it may improve performance.  If
they are sitting there and running queries against randomly-generated
non-existent table names at top speed, then they'll generate a lot of
catcache bloat, but that's not really any different from a database
with a large number of tables that DO exist which are queried at
random.  Workloads that access a lot of objects, whether those objects
exist or not, are going to use up a lot of cache entries, and I guess
that just seems OK to me.

> Agreed. The following is the whole image of the measure for
> syscache bloat considering "quick bloat". (I still think it is
> wanted under some situations.)
>
> 1. If a removal of any objects that make some syscache entries
>   stale (this cannot be checked without scanning whole a hash so
>   just queue it into, for exameple, recently_removed_relations
>   OID hash.)

If we just let some sort of cleanup process that generally blows away
rarely-used entries get rid of those entries too, then it should
handle this case, too, because the cache entries pertaining to removed
relations (or schemas) probably won't get used after that (and if they
do, we should keep them).  So I don't see that there is a need for
this, and it drew objections upthread because of the cost of scanning
the whole hash table.  Batching relations together might help, but it
doesn't really seem worth trying to sort out the problems with this
idea when we can do something better and more general.

> 2. If the number of the oid-hash entries reasches 1000 or 10000
>   (mmm. quite arbitrary..), Immediately clean up syscaches that
>   accepts/needs removed-reloid cleanup.  (The oid hash might be
>   needed separately for each target cache to avoid readandunt
>   scan, or to get rid a kind of generation management in the oid
>   hash.)

That is bound to draw a strong negative response from Tom, and for
good reason.  If the number of relations in the working set is 1001
and your cleanup threshold is 1000, cleanups will happen constantly
and performance will be poor.  This is exactly why, as I said in the
second email on this thread, the limit of on the size of the relcache
was removed.

>> 1. The first time we are due to expand the hash table, we check
>> whether we can forestall that expansion by doing a cleanup; if so, we
>> do that instead.
>
>   And if there's any entry in the removed-reloid hash it is
>   considered while cleanup.

As I say, I don't think there's any need for a removed-reloid hash.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> I see three kinds of syscache entries.
>> 
>> A. An entry for an actually existing object.
>> B. An entry for an object which once existed but no longer.
>> C. An entry for a just non-existent objects.

> I'm not convinced that it's useful to divide things up this way.

Actually, I don't believe that case B exists at all; such an entry
should get blown away by syscache invalidation when we commit the
DROP command.  If one were to stick around, we'd risk false positive
lookups later.

> I guess I still don't see why B is different.  If somebody sits there
> and runs queries against non-existent table names at top speed, maybe
> they'll query the same non-existent table entries more than once, in
> which case keeping the negative entries for the non-existent table
> names around until they stop doing it may improve performance.

FWIW, my recollection is that the reason for negative cache entries
is that there are some very common patterns where we probe for object
names (not just table names, either) that aren't there, typically as
a side effect of walking through the search_path looking for a match
to an unqualified object name.  Those cache entries aren't going to
get any less useful than the positive entry for the ultimately-found
object.  So from a lifespan point of view I'm not very sure that it's
worth distinguishing cases A and C.

It's conceivable that we could rewrite all the lookup algorithms
so that they didn't require negative cache entries to have good
performance ... but I doubt that that's easy to do.

            regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Tue, 19 Dec 2017 13:14:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <748.1513707249@sss.pgh.pa.us>
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> I see three kinds of syscache entries.
> >> 
> >> A. An entry for an actually existing object.
> >> B. An entry for an object which once existed but no longer.
> >> C. An entry for a just non-existent objects.
> 
> > I'm not convinced that it's useful to divide things up this way.
> 
> Actually, I don't believe that case B exists at all; such an entry
> should get blown away by syscache invalidation when we commit the
> DROP command.  If one were to stick around, we'd risk false positive
> lookups later.

As I have shown upthread, access to a temporary table (*1) leaves
several STATRELATTINH entries after DROPing, and it doesn't have
a chance to be deleted. SELECTing a nonexistent table in a schema
(*2) also leaves a RELNAMENSP entry after DROPing the schema. I'm
not sure that the latter happen so frequently but the former
happens rather frequently and quickly bloats the syscache once
happens. However no false positive can happen since such entiries
cannot be reached without parent objects, but on the other hand
they have no chance to be deleted.

*1:  begin; create temp table t1 (a int, b int, c int, d int, e int, f int, g int, h int, i int, j int) on commit drop;
insertinto t1 values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); select * from t1; commit;
 

*2: create schema foo; select * from foo.invalid; drop schema foo;

> > I guess I still don't see why B is different.  If somebody sits there
> > and runs queries against non-existent table names at top speed, maybe
> > they'll query the same non-existent table entries more than once, in
> > which case keeping the negative entries for the non-existent table
> > names around until they stop doing it may improve performance.
> 
> FWIW, my recollection is that the reason for negative cache entries
> is that there are some very common patterns where we probe for object
> names (not just table names, either) that aren't there, typically as
> a side effect of walking through the search_path looking for a match
> to an unqualified object name.  Those cache entries aren't going to
> get any less useful than the positive entry for the ultimately-found
> object.  So from a lifespan point of view I'm not very sure that it's
> worth distinguishing cases A and C.

Agreed.

> It's conceivable that we could rewrite all the lookup algorithms
> so that they didn't require negative cache entries to have good
> performance ... but I doubt that that's easy to do.

That sounds to me to be the same as improving the performance of
systable scan as the same as local hash. Lockless systable
(index) might work (if possible)?

Anyway, I think we are reached to a consensus that the
time-tick-based expiration is promising. So I'll work on the way
as the first step.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Fri, 22 Dec 2017 13:47:16 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171222.134716.88479707.horiguchi.kyotaro@lab.ntt.co.jp>
> Anyway, I think we are reached to a consensus that the
> time-tick-based expiration is promising. So I'll work on the way
> as the first step.

So this is the patch. It gets simpler.

# I became to think that the second step is not needed.

I'm not sure that no syscache aceess happens outside a statement
but the operations that lead to the bloat seem to be performed
while processing of a statement. So statement timestamp seems
sufficient as the aging clock.

At first I tried the simple strategy that removes entries that
have been left alone for 30 minutes or more but I still want to
alleviate the quick bloat (by non-reused entries) so I introduced
together a clock-sweep like aging mechanism. An entry is created
with naccessed = 0, then incremented up to 2 each time it is
accessed. Removal side decrements naccessed of entriies older
than 600 seconds then actually removes if it becomes 0. Entries
that just created and not used will go off in 600 seconds and
entries that have been accessed several times have 1800 seconds'
grace after the last acccess.

We could shrink bucket array together but I didn't since it is
not so large and is prone to grow up to the same size shortly
again.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 733,738 **** void
--- 733,741 ----
  SetCurrentStatementStartTimestamp(void)
  {
      stmtStartTimestamp = GetCurrentTimestamp();
+ 
+     /* Set this time stamp as aproximated current time */
+     SetCatCacheClock(stmtStartTimestamp);
  }
  
  /*
*** a/src/backend/utils/cache/catcache.c
--- b/src/backend/utils/cache/catcache.c
***************
*** 74,79 ****
--- 74,82 ----
  /* Cache management header --- pointer is NULL until created */
  static CatCacheHeader *CacheHdr = NULL;
  
+ /* Timestamp used for any operation on caches. */
+ TimestampTz    catcacheclock = 0;
+ 
  static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                         int nkeys,
                         Datum v1, Datum v2,
***************
*** 866,875 **** InitCatCache(int id,
--- 869,969 ----
       */
      MemoryContextSwitchTo(oldcxt);
  
+     /* initilize catcache reference clock if haven't done yet */
+     if (catcacheclock == 0)
+         catcacheclock = GetCurrentTimestamp();
+ 
      return cp;
  }
  
  /*
+  * Remove entries that haven't been accessed for a certain time.
+  *
+  * Sometimes catcache entries are left unremoved for several reasons. We
+  * cannot allow them to eat up the usable memory and still it is better to
+  * remove entries that are no longer accessed from the perspective of memory
+  * performance ratio. Unfortunately we cannot predict that but we can assume
+  * that entries that are not accessed for long time no longer contribute to
+  * performance.
+  */
+ static bool
+ CatCacheCleanupOldEntries(CatCache *cp)
+ {
+     int            i;
+     int            nremoved = 0;
+ #ifdef CATCACHE_STATS
+     int            ntotal = 0;
+     int            tm[] = {30, 60, 600, 1200, 1800, 0};
+     int            cn[6] = {0, 0, 0, 0, 0};
+     int            cage[3] = {0, 0, 0};
+ #endif
+ 
+     /* Move all entries from old hash table to new. */
+     for (i = 0; i < cp->cc_nbuckets; i++)
+     {
+         dlist_mutable_iter iter;
+ 
+         dlist_foreach_modify(iter, &cp->cc_bucket[i])
+         {
+             CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+             long s;
+             int us;
+ 
+ 
+             TimestampDifference(ct->lastaccess, catcacheclock, &s, &us);
+ 
+ #ifdef CATCACHE_STATS
+             {
+                 int j;
+ 
+                 ntotal++;
+                 for (j = 0 ; tm[j] != 0 && s > tm[j] ; j++);
+                 if (tm[j] == 0) j--;
+                 cn[j]++;
+             }
+ #endif
+ 
+             /*
+              * Remove entries older than 600 seconds but not recently used.
+              * Entries that are not accessed after creation are removed in 600
+              * seconds, and that has been used several times are removed after
+              * 30 minumtes ignorance. We don't try shrink buckets since they
+              * are not the major part of syscache bloat and they are expected
+              * to be filled shortly again.
+              */
+             if (s > 600)
+             {
+ #ifdef CATCACHE_STATS
+                 Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                 cage[ct->naccess]++;
+ #endif
+                 if (ct->naccess > 0)
+                     ct->naccess--;
+                 else
+                 {
+                     if (!ct->c_list || ct->c_list->refcount == 0)
+                     {
+                         CatCacheRemoveCTup(cp, ct);
+                         nremoved++;
+                     }
+                 }
+             }
+         }
+     }
+ 
+ #ifdef CATCACHE_STATS
+     ereport(DEBUG2,
+             (errmsg ("removed %d/%d, age(-30s:%d, -60s:%d, -600s:%d, -1200s:%d, -1800:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                 nremoved, ntotal,
+                 cn[0], cn[1], cn[2], cn[3], cn[4],
+                 cage[0], cage[1], cage[2]),
+              errhidestmt(true)));
+ #endif
+ 
+     return nremoved > 0;
+ }
+ 
+ /*
   * Enlarge a catcache, doubling the number of buckets.
   */
  static void
***************
*** 1282,1287 **** SearchCatCacheInternal(CatCache *cache,
--- 1376,1389 ----
           */
          dlist_move_head(bucket, &ct->cache_elem);
  
+ 
+         /*
+          * Update the last access time of this entry
+          */
+         if (ct->naccess < 2)
+             ct->naccess++;
+         ct->lastaccess = catcacheclock;
+ 
          /*
           * If it's a positive entry, bump its refcount and return it. If it's
           * negative, we can report failure to the caller.
***************
*** 1901,1906 **** CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
--- 2003,2010 ----
      ct->dead = false;
      ct->negative = negative;
      ct->hash_value = hashValue;
+     ct->naccess = 0;
+     ct->lastaccess = catcacheclock;
  
      dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
  
***************
*** 1911,1917 **** CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
       * If the hash table has become too full, enlarge the buckets array. Quite
       * arbitrarily, we enlarge when fill factor > 2.
       */
!     if (cache->cc_ntup > cache->cc_nbuckets * 2)
          RehashCatCache(cache);
  
      return ct;
--- 2015,2022 ----
       * If the hash table has become too full, enlarge the buckets array. Quite
       * arbitrarily, we enlarge when fill factor > 2.
       */
!     if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
!         !CatCacheCleanupOldEntries(cache))
          RehashCatCache(cache);
  
      return ct;
*** a/src/include/utils/catcache.h
--- b/src/include/utils/catcache.h
***************
*** 119,124 **** typedef struct catctup
--- 119,126 ----
      bool        dead;            /* dead but not yet removed? */
      bool        negative;        /* negative cache entry? */
      HeapTupleData tuple;        /* tuple management header */
+     int            naccess;        /* # of access to this entry  */
+     TimestampTz    lastaccess;        /* approx. TS of the last access/modification */
  
      /*
       * The tuple may also be a member of at most one CatCList.  (If a single
***************
*** 189,194 **** typedef struct catcacheheader
--- 191,203 ----
  /* this extern duplicates utils/memutils.h... */
  extern PGDLLIMPORT MemoryContext CacheMemoryContext;
  
+ extern PGDLLIMPORT TimestampTz catcacheclock;
+ static inline void
+ SetCatCacheClock(TimestampTz ts)
+ {
+     catcacheclock = ts;
+ }
+ 
  extern void CreateCacheMemoryContext(void);
  
  extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,

Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2017-12-26 18:19:16 +0900, Kyotaro HORIGUCHI wrote:
> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -733,6 +733,9 @@ void
>  SetCurrentStatementStartTimestamp(void)
>  {
>      stmtStartTimestamp = GetCurrentTimestamp();
> +
> +    /* Set this time stamp as aproximated current time */
> +    SetCatCacheClock(stmtStartTimestamp);
>  }

Hm.


> + * Remove entries that haven't been accessed for a certain time.
> + *
> + * Sometimes catcache entries are left unremoved for several reasons.

I'm unconvinced that that's ok for positive entries, entirely regardless
of this patch.


> We
> + * cannot allow them to eat up the usable memory and still it is better to
> + * remove entries that are no longer accessed from the perspective of memory
> + * performance ratio. Unfortunately we cannot predict that but we can assume
> + * that entries that are not accessed for long time no longer contribute to
> + * performance.
> + */

This needs polish.


> +static bool
> +CatCacheCleanupOldEntries(CatCache *cp)
> +{
> +    int            i;
> +    int            nremoved = 0;
> +#ifdef CATCACHE_STATS
> +    int            ntotal = 0;
> +    int            tm[] = {30, 60, 600, 1200, 1800, 0};
> +    int            cn[6] = {0, 0, 0, 0, 0};
> +    int            cage[3] = {0, 0, 0};
> +#endif

This doesn't look nice, the names descriptive enough to be self evident,
and there's no comments what these random arrays mean. And some specify
lenght (and have differing number of elements!) and others don't.


> +    /* Move all entries from old hash table to new. */
> +    for (i = 0; i < cp->cc_nbuckets; i++)
> +    {
> +        dlist_mutable_iter iter;
> +
> +        dlist_foreach_modify(iter, &cp->cc_bucket[i])
> +        {
> +            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
> +            long s;
> +            int us;
> +
> +
> +            TimestampDifference(ct->lastaccess, catcacheclock, &s, &us);
> +
> +#ifdef CATCACHE_STATS
> +            {
> +                int j;
> +
> +                ntotal++;
> +                for (j = 0 ; tm[j] != 0 && s > tm[j] ; j++);
> +                if (tm[j] == 0) j--;
> +                cn[j]++;
> +            }
> +#endif

What?


> +            /*
> +             * Remove entries older than 600 seconds but not recently used.
> +             * Entries that are not accessed after creation are removed in 600
> +             * seconds, and that has been used several times are removed after
> +             * 30 minumtes ignorance. We don't try shrink buckets since they
> +             * are not the major part of syscache bloat and they are expected
> +             * to be filled shortly again.
> +             */
> +            if (s > 600)
> +            {

So this is hardcoded, without any sort of cache pressure logic? Doesn't
that mean we'll often *severely* degrade performance if a backend is
idle for a while?


Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Mar 1, 2018 at 1:54 PM, Andres Freund <andres@anarazel.de> wrote:
> So this is hardcoded, without any sort of cache pressure logic? Doesn't
> that mean we'll often *severely* degrade performance if a backend is
> idle for a while?

Well, it is true that if we flush cache entries that haven't been used
in a long time, a backend that is idle for a long time might be a bit
slow when (and if) it eventually becomes non-idle, because it may have
to reload some of those flushed entries.  On the other hand, a backend
that holds onto a large number of cache entries that it's not using
for tens of minutes at a time degrades the performance of the whole
system unless, of course, you're running on a machine that is under no
memory pressure at all.  I don't understand why people keep acting as
if holding onto cache entries regardless of how infrequently they're
being used is an unalloyed good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
Hi,

On 2018-03-01 14:24:56 -0500, Robert Haas wrote:
> On Thu, Mar 1, 2018 at 1:54 PM, Andres Freund <andres@anarazel.de> wrote:
> > So this is hardcoded, without any sort of cache pressure logic? Doesn't
> > that mean we'll often *severely* degrade performance if a backend is
> > idle for a while?
> 
> Well, it is true that if we flush cache entries that haven't been used
> in a long time, a backend that is idle for a long time might be a bit
> slow when (and if) it eventually becomes non-idle, because it may have
> to reload some of those flushed entries.

Right. Which might be very painful latency wise. And with poolers it's
pretty easy to get into situations like that, without the app
influencing it.


> On the other hand, a backend that holds onto a large number of cache
> entries that it's not using for tens of minutes at a time degrades the
> performance of the whole system unless, of course, you're running on a
> machine that is under no memory pressure at all.

But it's *extremely* common to have no memory pressure these days. The
inverse definitely also exists.


> I don't understand why people keep acting as if holding onto cache
> entries regardless of how infrequently they're being used is an
> unalloyed good.

Huh? I'm definitely not arguing for that? I think we want a feature like
this, I just don't think the logic when to prune is quite sophisticated
enough?

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
> Right. Which might be very painful latency wise. And with poolers it's
> pretty easy to get into situations like that, without the app
> influencing it.

Really?  I'm not sure I believe that.  You're talking perhaps a few
milliseconds - maybe less - of additional latency on a connection
that's been idle for many minutes.  You need to have a workload that
involves leaving connections idle for very long periods but has
extremely tight latency requirements when it does finally send a
query.  I suppose such workloads exist, but I would not think them
common.

Anyway, I don't mind making the exact timeout a GUC (with 0 disabling
the feature altogether) if that addresses your concern, but in general
I think that it's reasonable to accept that a connection that's been
idle for a long time may have a little bit more latency than usual
when you start using it again.  That could happen for other reasons
anyway -- e.g. the cache could have been flushed because of concurrent
DDL on the objects you were accessing, by a syscache reset caused by a
flood of temp objects being created, or by the operating system
deciding to page out some of your data, or by your data getting
evicted from the CPU caches, or by being scheduled onto a NUMA node
different than the one that contains its data.  Operating systems have
been optimizing for the performance of relatively active processes
over ones that have been idle for a long time since the 1960s or
earlier, and I don't know of any reason why PostgreSQL shouldn't do
the same.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
> On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
> > Right. Which might be very painful latency wise. And with poolers it's
> > pretty easy to get into situations like that, without the app
> > influencing it.
> 
> Really?  I'm not sure I believe that.  You're talking perhaps a few
> milliseconds - maybe less - of additional latency on a connection
> that's been idle for many minutes.

I've seen latency increases in second+ ranges due to empty cat/sys/rel
caches.  And the connection doesn't have to be idle, it might just have
been active for a different application doing different things, thus
accessing different cache entries.  With a pooler you can trivially end
up switch connections occasionally between different [parts of]
applications, and you don't want performance to suck after each time.
You also don't want to use up all memory, I entirely agree on that.


> Anyway, I don't mind making the exact timeout a GUC (with 0 disabling
> the feature altogether) if that addresses your concern, but in general
> I think that it's reasonable to accept that a connection that's been
> idle for a long time may have a little bit more latency than usual
> when you start using it again.

I don't think that'd quite address my concern. I just don't think that
the granularity (drop all entries older than xxx sec at the next resize)
is right. For one I don't want to drop stuff if the cache size isn't a
problem for the current memory budget. For another, I'm not convinced
that dropping entries from the current "generation" at resize won't end
up throwing away too much.

If we'd a guc 'syscache_memory_target' and we'd only start pruning if
above it, I'd be much happier.


Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
>> On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Right. Which might be very painful latency wise. And with poolers it's
>> > pretty easy to get into situations like that, without the app
>> > influencing it.
>>
>> Really?  I'm not sure I believe that.  You're talking perhaps a few
>> milliseconds - maybe less - of additional latency on a connection
>> that's been idle for many minutes.
>
> I've seen latency increases in second+ ranges due to empty cat/sys/rel
> caches.

How is that even possible unless the system is grossly overloaded?

>> Anyway, I don't mind making the exact timeout a GUC (with 0 disabling
>> the feature altogether) if that addresses your concern, but in general
>> I think that it's reasonable to accept that a connection that's been
>> idle for a long time may have a little bit more latency than usual
>> when you start using it again.
>
> I don't think that'd quite address my concern. I just don't think that
> the granularity (drop all entries older than xxx sec at the next resize)
> is right. For one I don't want to drop stuff if the cache size isn't a
> problem for the current memory budget. For another, I'm not convinced
> that dropping entries from the current "generation" at resize won't end
> up throwing away too much.

I think that a fixed memory budget for the syscache is an idea that
was tried many years ago and basically failed, because it's very easy
to end up with terrible eviction patterns -- e.g. if you are accessing
11 relations in round-robin fashion with a 10-relation cache, your
cache nets you a 0% hit rate but takes a lot more maintenance than
having no cache at all.  The time-based approach lets the cache grow
with no fixed upper limit without allowing unused entries to stick
around forever.

> If we'd a guc 'syscache_memory_target' and we'd only start pruning if
> above it, I'd be much happier.

It does seem reasonable to skip pruning altogether if the cache is
below some threshold size.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
Hi,

On 2018-03-01 15:19:26 -0500, Robert Haas wrote:
> On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
> >> On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
> >> > Right. Which might be very painful latency wise. And with poolers it's
> >> > pretty easy to get into situations like that, without the app
> >> > influencing it.
> >>
> >> Really?  I'm not sure I believe that.  You're talking perhaps a few
> >> milliseconds - maybe less - of additional latency on a connection
> >> that's been idle for many minutes.
> >
> > I've seen latency increases in second+ ranges due to empty cat/sys/rel
> > caches.
> 
> How is that even possible unless the system is grossly overloaded?

You just need to have catalog contents out of cache and statements
touching a few relations, functions, etc. Indexscan + heap fetch
latencies do add up quite quickly if done sequentially.


> > I don't think that'd quite address my concern. I just don't think that
> > the granularity (drop all entries older than xxx sec at the next resize)
> > is right. For one I don't want to drop stuff if the cache size isn't a
> > problem for the current memory budget. For another, I'm not convinced
> > that dropping entries from the current "generation" at resize won't end
> > up throwing away too much.
> 
> I think that a fixed memory budget for the syscache is an idea that
> was tried many years ago and basically failed, because it's very easy
> to end up with terrible eviction patterns -- e.g. if you are accessing
> 11 relations in round-robin fashion with a 10-relation cache, your
> cache nets you a 0% hit rate but takes a lot more maintenance than
> having no cache at all.  The time-based approach lets the cache grow
> with no fixed upper limit without allowing unused entries to stick
> around forever.

I definitely think we want a time based component to this, I just want
to not prune at all if we're below a certain size.


> > If we'd a guc 'syscache_memory_target' and we'd only start pruning if
> > above it, I'd be much happier.
> 
> It does seem reasonable to skip pruning altogether if the cache is
> below some threshold size.

Cool. There might be some issues making that check performant enough,
but I don't have a good intuition on it.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello.

Thank you for the discussion, and sorry for being late to come.

At Thu, 1 Mar 2018 12:26:30 -0800, Andres Freund <andres@anarazel.de> wrote in
<20180301202630.2s6untij2x5hpksn@alap3.anarazel.de>
> Hi,
> 
> On 2018-03-01 15:19:26 -0500, Robert Haas wrote:
> > On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote:
> > > On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
> > >> On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
> > >> > Right. Which might be very painful latency wise. And with poolers it's
> > >> > pretty easy to get into situations like that, without the app
> > >> > influencing it.
> > >>
> > >> Really?  I'm not sure I believe that.  You're talking perhaps a few
> > >> milliseconds - maybe less - of additional latency on a connection
> > >> that's been idle for many minutes.
> > >
> > > I've seen latency increases in second+ ranges due to empty cat/sys/rel
> > > caches.
> > 
> > How is that even possible unless the system is grossly overloaded?
> 
> You just need to have catalog contents out of cache and statements
> touching a few relations, functions, etc. Indexscan + heap fetch
> latencies do add up quite quickly if done sequentially.
> 
> 
> > > I don't think that'd quite address my concern. I just don't think that
> > > the granularity (drop all entries older than xxx sec at the next resize)
> > > is right. For one I don't want to drop stuff if the cache size isn't a
> > > problem for the current memory budget. For another, I'm not convinced
> > > that dropping entries from the current "generation" at resize won't end
> > > up throwing away too much.
> > 
> > I think that a fixed memory budget for the syscache is an idea that
> > was tried many years ago and basically failed, because it's very easy
> > to end up with terrible eviction patterns -- e.g. if you are accessing
> > 11 relations in round-robin fashion with a 10-relation cache, your
> > cache nets you a 0% hit rate but takes a lot more maintenance than
> > having no cache at all.  The time-based approach lets the cache grow
> > with no fixed upper limit without allowing unused entries to stick
> > around forever.
> 
> I definitely think we want a time based component to this, I just want
> to not prune at all if we're below a certain size.
> 
> 
> > > If we'd a guc 'syscache_memory_target' and we'd only start pruning if
> > > above it, I'd be much happier.
> > 
> > It does seem reasonable to skip pruning altogether if the cache is
> > below some threshold size.
> 
> Cool. There might be some issues making that check performant enough,
> but I don't have a good intuition on it.

So..

- Now it gets two new GUC variables named syscache_prune_min_age
  and syscache_memory_target. The former is the replacement of
  the previous magic number 600 and defaults to the same
  number. The latter prevens syscache pruning until exceeding the
  size and defaults to 0, means that pruning is always
  considered.  Documentation for the two variables are also
  added.

- Revised the pointed mysterious comment for
  CatcacheCleanupOldEntries and some comments are added.

- Fixed the name of the variables for CATCACHE_STATS to be more
  descriptive, and added some comments for the code.

The catcache entries accessed within the current transaction
won't be pruned so theoretically a long transaction can bloat
catcache. But I believe it is quite rare, or at least this saves
the most other cases.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From d3b73b68ed4ce246a0892ac72ec2eed1a47429f2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Dec 2017 17:43:09 +0900
Subject: [PATCH] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 +++++++
 src/backend/access/transam/xact.c             |   3 +
 src/backend/utils/cache/catcache.c            | 158 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/utils/catcache.h                  |  19 ++++
 6 files changed, 238 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 259a2d83b4..fd25669abc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1556,6 +1556,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specify the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specify the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dbaaf8e005..86d76917bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -733,6 +733,9 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     stmtStartTimestamp = GetCurrentTimestamp();
+
+    /* Set this timestamp as aproximated current time */
+    SetCatCacheClock(stmtStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 5ddbf6eab1..56d4f10019 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,23 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * Let the name to be samewith the guc variable name, not using 'catcache'.
+ */
+int syscache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered
+ * to be evicted in seconds.
+ */
+int syscache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -866,9 +880,133 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove catcache
+ * entries that are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for that multiple in ageclass of
+     * syscache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (syscache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     * Since the area for bucket array is dominant, consider only it.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size < syscache_memory_target * 1024)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the time from the time of the last access to the
+             * "current" time. Since catcacheclock is not advance within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > syscache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Remove entries older than syscache_prune_min_age seconds but
+             * not recently used.  Entries that are not accessed after last
+             * access are removed in that seconds, and that has been used
+             * several times are removed after leaving alone for up to three
+             * times of the duration. We don't try shrink buckets since this
+             * effectively prevents the catcache from enlarged in the long
+             * term.
+             */
+            if (entry_age > syscache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG2,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d),hash_size = %lubytes, %d", 
+                     nremoved, ntotal,
+                     ageclass[0] * syscache_prune_min_age, nentries[0],
+                     ageclass[1] * syscache_prune_min_age, nentries[1],
+                     ageclass[2] * syscache_prune_min_age, nentries[2],
+                     ageclass[3] * syscache_prune_min_age, nentries[3],
+                     ageclass[4] * syscache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2],
+                     hash_size, syscache_memory_target
+                ),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1282,6 +1420,14 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+
+        /*
+         * Update the last access time of this entry
+         */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1813,7 +1959,6 @@ ReleaseCatCacheList(CatCList *list)
         CatCacheRemoveCList(list->my_cache, list);
 }
 
-
 /*
  * CatalogCacheCreateEntry
  *        Create a new CatCTup entry, copying the given HeapTuple and other
@@ -1906,6 +2051,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -1913,10 +2060,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CacheHdr->ch_ntup++;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed to remove an entry, enlarge the bucket array instead.  Quite
+     * arbitrarily, we try this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1db7845d5a..a63bc5eb79 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -78,6 +78,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
@@ -1971,6 +1972,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"syscache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Syscache entries are not pruned after the size of syscache exceeds this size."),
+            GUC_UNIT_KB
+        },
+        &syscache_memory_target,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"syscache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum time to consider syscahe pruning."),
+            gettext_noop("Syscache entries lives less than this seconds will not be considered to be pruned."),
+            GUC_UNIT_S
+        },
+        &syscache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39272925fb..0bda73d080 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -124,6 +124,7 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#syscache_prune_min_age = 600s    # minimum age of syscache entries to keep
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7b22f9c7bc..eb89a9f0d7 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -119,6 +120,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int syscache_prune_min_age;
+extern int syscache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.2


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Oops! The previous ptach contained garbage printing in debugging
output.

The attached is the new version without the garbage. Addition to
it, I changed my mind to use DEBUG1 for the debug message since
the frequency is quite low.

No changes in the following cited previous mail.

At Wed, 07 Mar 2018 16:19:23 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180307.161923.178158050.horiguchi.kyotaro@lab.ntt.co.jp>
==================
Hello.

Thank you for the discussion, and sorry for being late to come.

At Thu, 1 Mar 2018 12:26:30 -0800, Andres Freund <andres@anarazel.de> wrote in
<20180301202630.2s6untij2x5hpksn@alap3.anarazel.de>
> Hi,
> 
> On 2018-03-01 15:19:26 -0500, Robert Haas wrote:
> > On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote:
> > > On 2018-03-01 14:49:26 -0500, Robert Haas wrote:
> > >> On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote:
> > >> > Right. Which might be very painful latency wise. And with poolers it's
> > >> > pretty easy to get into situations like that, without the app
> > >> > influencing it.
> > >>
> > >> Really?  I'm not sure I believe that.  You're talking perhaps a few
> > >> milliseconds - maybe less - of additional latency on a connection
> > >> that's been idle for many minutes.
> > >
> > > I've seen latency increases in second+ ranges due to empty cat/sys/rel
> > > caches.
> > 
> > How is that even possible unless the system is grossly overloaded?
> 
> You just need to have catalog contents out of cache and statements
> touching a few relations, functions, etc. Indexscan + heap fetch
> latencies do add up quite quickly if done sequentially.
> 
> 
> > > I don't think that'd quite address my concern. I just don't think that
> > > the granularity (drop all entries older than xxx sec at the next resize)
> > > is right. For one I don't want to drop stuff if the cache size isn't a
> > > problem for the current memory budget. For another, I'm not convinced
> > > that dropping entries from the current "generation" at resize won't end
> > > up throwing away too much.
> > 
> > I think that a fixed memory budget for the syscache is an idea that
> > was tried many years ago and basically failed, because it's very easy
> > to end up with terrible eviction patterns -- e.g. if you are accessing
> > 11 relations in round-robin fashion with a 10-relation cache, your
> > cache nets you a 0% hit rate but takes a lot more maintenance than
> > having no cache at all.  The time-based approach lets the cache grow
> > with no fixed upper limit without allowing unused entries to stick
> > around forever.
> 
> I definitely think we want a time based component to this, I just want
> to not prune at all if we're below a certain size.
> 
> 
> > > If we'd a guc 'syscache_memory_target' and we'd only start pruning if
> > > above it, I'd be much happier.
> > 
> > It does seem reasonable to skip pruning altogether if the cache is
> > below some threshold size.
> 
> Cool. There might be some issues making that check performant enough,
> but I don't have a good intuition on it.

So..

- Now it gets two new GUC variables named syscache_prune_min_age
  and syscache_memory_target. The former is the replacement of
  the previous magic number 600 and defaults to the same
  number. The latter prevens syscache pruning until exceeding the
  size and defaults to 0, means that pruning is always
  considered.  Documentation for the two variables are also
  added.

- Revised the pointed mysterious comment for
  CatcacheCleanupOldEntries and some comments are added.

- Fixed the name of the variables for CATCACHE_STATS to be more
  descriptive, and added some comments for the code.

The catcache entries accessed within the current transaction
won't be pruned so theoretically a long transaction can bloat
catcache. But I believe it is quite rare, or at least this saves
the most other cases.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 975e7e82d4eeb7d7b7ecf981141a8924297c46ef Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Dec 2017 17:43:09 +0900
Subject: [PATCH] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 +++++++
 src/backend/access/transam/xact.c             |   3 +
 src/backend/utils/cache/catcache.c            | 152 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  19 ++++
 6 files changed, 233 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 259a2d83b4..782b506984 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1556,6 +1556,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dbaaf8e005..86d76917bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -733,6 +733,9 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     stmtStartTimestamp = GetCurrentTimestamp();
+
+    /* Set this timestamp as aproximated current time */
+    SetCatCacheClock(stmtStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 5ddbf6eab1..e4a9ab8789 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,23 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * Let the name be the same with the guc variable name, not using 'catcache'.
+ */
+int syscache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered
+ * to be evicted in seconds. Ditto for the name.
+ */
+int syscache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -866,9 +880,130 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they not accessed for a certain time to prevent catcache from bloating. The
+ * eviction is performed with the similar algorithm with buffer eviction using
+ * access counter. Entries that are accessed several times can live longer
+ * than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * syscache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (syscache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     * Since the area for bucket array is dominant, consider only it.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size < syscache_memory_target * 1024)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > syscache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than syscache_prune_min_age
+             * seconds.  Entries that are not accessed after last pruning are
+             * removed in that seconds, and that has been accessed several
+             * times are removed after leaving alone for up to three times of
+             * the duration. We don't try shrink buckets since pruning
+             * effectively caps catcache expansion in the long term.
+             */
+            if (entry_age > syscache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * syscache_prune_min_age, nentries[0],
+                     ageclass[1] * syscache_prune_min_age, nentries[1],
+                     ageclass[2] * syscache_prune_min_age, nentries[2],
+                     ageclass[3] * syscache_prune_min_age, nentries[3],
+                     ageclass[4] * syscache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1282,6 +1417,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1813,7 +1953,6 @@ ReleaseCatCacheList(CatCList *list)
         CatCacheRemoveCList(list->my_cache, list);
 }
 
-
 /*
  * CatalogCacheCreateEntry
  *        Create a new CatCTup entry, copying the given HeapTuple and other
@@ -1906,6 +2045,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -1913,10 +2054,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CacheHdr->ch_ntup++;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1db7845d5a..33abe04efe 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -78,6 +78,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
@@ -1971,6 +1972,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"syscache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Syscache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &syscache_memory_target,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"syscache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum duration of an unused syscache entry to remove."),
+            gettext_noop("Syscache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &syscache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39272925fb..5a5729a88f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -124,6 +124,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#syscache_memory_target = 0kB    # in kB. zero disables the feature
+#syscache_prune_min_age = 600s    # -1 disables the feature
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7b22f9c7bc..c3c4d65998 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -119,6 +120,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int syscache_prune_min_age;
+extern int syscache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.2


Re: Protect syscache from bloating with negative cache entries

От
Alvaro Herrera
Дата:
The thing that comes to mind when reading this patch is that some time
ago we made fun of other database software, "they are so complicated to
configure, they have some magical settings that few people understand
how to set".  Postgres was so much better because it was simple to set
up, no magic crap.  But now it becomes apparent that that only was so
because Postgres sucked, ie., we hadn't yet gotten to the point where we
*needed* to introduce settings like that.  Now we finally are?

I have to admit being a little disappointed about that outcome.

I wonder if this is just because we refuse to acknowledge the notion of
a connection pooler.  If we did, and the pooler told us "here, this
session is being given back to us by the application, we'll keep it
around until the next app comes along", could we clean the oldest
inactive cache entries at that point?  Currently they use DISCARD for
that.  Though this does nothing to fix hypothetical cache bloat for
pg_dump in bug #14936.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
Hi,

On 2018-03-07 08:01:38 -0300, Alvaro Herrera wrote:
> I wonder if this is just because we refuse to acknowledge the notion of
> a connection pooler.  If we did, and the pooler told us "here, this
> session is being given back to us by the application, we'll keep it
> around until the next app comes along", could we clean the oldest
> inactive cache entries at that point?  Currently they use DISCARD for
> that.  Though this does nothing to fix hypothetical cache bloat for
> pg_dump in bug #14936.

I'm not seeing how this solves anything?  You don't want to throw all
caches away, therefore you need a target size.  Then there's also the
case of the cache being too large in a single "session".

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Alvaro Herrera
Дата:
Hello,

Andres Freund wrote:

> On 2018-03-07 08:01:38 -0300, Alvaro Herrera wrote:
> > I wonder if this is just because we refuse to acknowledge the notion of
> > a connection pooler.  If we did, and the pooler told us "here, this
> > session is being given back to us by the application, we'll keep it
> > around until the next app comes along", could we clean the oldest
> > inactive cache entries at that point?  Currently they use DISCARD for
> > that.  Though this does nothing to fix hypothetical cache bloat for
> > pg_dump in bug #14936.
> 
> I'm not seeing how this solves anything?  You don't want to throw all
> caches away, therefore you need a target size.  Then there's also the
> case of the cache being too large in a single "session".

Oh, I wasn't suggesting to throw away the whole cache at that point;
only that that is a convenient to do whatever cleanup we want to do.
What I'm not clear about is exactly what is the cleanup that we want to
do at that point.  You say it should be based on some configured size;
Robert says any predefined size breaks [performance for] the case where
the workload uses size+1, so let's use time instead (evict anything not
used in more than X seconds?), but keeping in mind that a workload that
requires X+1 would also break.  So it seems we've arrived at the
conclusion that the only possible solution is to let the user tell us
what time/size to use.  But that sucks, because the user doesn't know
either (maybe they can measure, but how?), and they don't even know that
this setting is there to be tweaked; and if there is a performance
problem, how do they figure whether or not it can be fixed by fooling
with this parameter?  I mean, maybe it's set to 10 and we suggest "maybe
11 works better" but it turns out not to, so "maybe 12 works better"?
How do you know when to stop increasing it?

This seems a bit like max_fsm_pages, that is to say, a disaster that was
only fixed by removing it.

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2018-03-07 14:48:48 -0300, Alvaro Herrera wrote:
> Oh, I wasn't suggesting to throw away the whole cache at that point;
> only that that is a convenient to do whatever cleanup we want to do.

But why is that better than doing so continuously?


> What I'm not clear about is exactly what is the cleanup that we want to
> do at that point.  You say it should be based on some configured size;
> Robert says any predefined size breaks [performance for] the case where
> the workload uses size+1, so let's use time instead (evict anything not
> used in more than X seconds?), but keeping in mind that a workload that
> requires X+1 would also break.

We mostly seem to have found that adding a *minimum* size before
starting evicting basedon time solves both of our concerns?


> So it seems we've arrived at the
> conclusion that the only possible solution is to let the user tell us
> what time/size to use.  But that sucks, because the user doesn't know
> either (maybe they can measure, but how?), and they don't even know that
> this setting is there to be tweaked; and if there is a performance
> problem, how do they figure whether or not it can be fixed by fooling
> with this parameter?  I mean, maybe it's set to 10 and we suggest "maybe
> 11 works better" but it turns out not to, so "maybe 12 works better"?
> How do you know when to stop increasing it?

I don't think it's that complicated, for the size figure. Having a knob
that controls how much memory a backend uses isn't a new concept, and
can definitely depend on the usecase.


> This seems a bit like max_fsm_pages, that is to say, a disaster that was
> only fixed by removing it.

I don't think that's a meaningful comparison. max_fms_pages had
persistent effect, couldn't be tuned without restarts, and the
performance dropoffs were much more "cliff" like.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Wed, Mar 7, 2018 at 6:01 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> The thing that comes to mind when reading this patch is that some time
> ago we made fun of other database software, "they are so complicated to
> configure, they have some magical settings that few people understand
> how to set".  Postgres was so much better because it was simple to set
> up, no magic crap.  But now it becomes apparent that that only was so
> because Postgres sucked, ie., we hadn't yet gotten to the point where we
> *needed* to introduce settings like that.  Now we finally are?
>
> I have to admit being a little disappointed about that outcome.

I think your disappointment is a little excessive.  I am not convinced
of the need either for this to have any GUCs at all, but if it makes
other people happy to have them, then I think it's worth accepting
that as the price of getting the feature into the tree.  These are
scarcely the first GUCs we have that are hard to tune.  work_mem is a
terrible knob, and there are probably like very few people who know
how to set ssl_ecdh_curve to anything other than the default, and
what's geqo_selection_bias good for, anyway?  I'm not sure what makes
the settings we're adding here any different.  Most people will ignore
them, and a few people who really care can change the values.

> I wonder if this is just because we refuse to acknowledge the notion of
> a connection pooler.  If we did, and the pooler told us "here, this
> session is being given back to us by the application, we'll keep it
> around until the next app comes along", could we clean the oldest
> inactive cache entries at that point?  Currently they use DISCARD for
> that.  Though this does nothing to fix hypothetical cache bloat for
> pg_dump in bug #14936.

We could certainly clean the oldest inactive cache entries at that
point, but there's no guarantee that would be the right thing to do.
If the working set across all applications is small enough that you
can keep them all in the caches all the time, then you should do that,
for maximum performance.  If not, DISCARD ALL should probably flush
everything that the last application needed and the next application
won't.  But without some configuration knob, you have zero way of
knowing how concerned the user is about saving memory in this place
vs. improving performance by reducing catalog scans.  Even with such a
knob it's a little difficult to say which things actually ought to be
thrown away.

I think this is a related problem, but a different one.  I also think
we ought to have built-in connection pooling.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Alvaro Herrera [mailto:alvherre@alvh.no-ip.org]
> The thing that comes to mind when reading this patch is that some time ago
> we made fun of other database software, "they are so complicated to configure,
> they have some magical settings that few people understand how to set".
> Postgres was so much better because it was simple to set up, no magic crap.
> But now it becomes apparent that that only was so because Postgres sucked,
> ie., we hadn't yet gotten to the point where we
> *needed* to introduce settings like that.  Now we finally are?

Yes.  We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly access
about200,000 tables.  It is estimated based on a small experiment that each backend will use several to ten GBs of
localmemory for CacheMemoryContext.  The total memory use will become over 1 TB when the expected maximum connections
areused.
 

I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache, plancache,
etc?

Regards
Takayuki Tsunakawa







Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello,

At Thu, 8 Mar 2018 00:28:04 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05>
> From: Alvaro Herrera [mailto:alvherre@alvh.no-ip.org]
> > The thing that comes to mind when reading this patch is that some time ago
> > we made fun of other database software, "they are so complicated to configure,
> > they have some magical settings that few people understand how to set".
> > Postgres was so much better because it was simple to set up, no magic crap.
> > But now it becomes apparent that that only was so because Postgres sucked,
> > ie., we hadn't yet gotten to the point where we
> > *needed* to introduce settings like that.  Now we finally are?
> 
> Yes.  We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly
accessabout 200,000 tables.  It is estimated based on a small experiment that each backend will use several to ten GBs
oflocal memory for CacheMemoryContext.  The total memory use will become over 1 TB when the expected maximum
connectionsare used.
 
> 
> I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache,
plancache,etc?
 

This works only for syscaches, which could bloat with entries for
nonexistent objects.

Plan cache is a utterly deferent thing. It is abandoned at the
end of a transaction or such like.

Relcache is not based on catcache and out of the scope of this
patch since it doesn't get bloat with nonexistent entries. It
uses dynahash and we could introduce a similar feature to it if
we are willing to cap relcache size.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> At Thu, 8 Mar 2018 00:28:04 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05>
>> Yes.  We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly
accessabout 200,000 tables.  It is estimated based on a small experiment that each backend will use several to ten GBs
oflocal memory for CacheMemoryContext.  The total memory use will become over 1 TB when the expected maximum
connectionsare used. 
>>
>> I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache,
plancache,etc? 

> This works only for syscaches, which could bloat with entries for
> nonexistent objects.

> Plan cache is a utterly deferent thing. It is abandoned at the
> end of a transaction or such like.

When I was at Salesforce, we had *substantial* problems with plancache
bloat.  The driving factor there was plans associated with plpgsql
functions, which Salesforce had a huge number of.  In an environment
like that, there would be substantial value in being able to prune
both the plancache and plpgsql's function cache.  (Note that neither
of those things are "abandoned at the end of a transaction".)

> Relcache is not based on catcache and out of the scope of this
> patch since it doesn't get bloat with nonexistent entries. It
> uses dynahash and we could introduce a similar feature to it if
> we are willing to cap relcache size.

I think if the case of concern is an application with 200,000 tables,
it's just nonsense to claim that relcache size isn't an issue.

In short, it's not really apparent to me that negative syscache entries
are the major problem of this kind.  I'm afraid that you're drawing very
large conclusions from a specific workload.  Maybe we could fix that
workload some other way.

            regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Wed, 07 Mar 2018 23:12:29 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <352.1520482349@sss.pgh.pa.us>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> > At Thu, 8 Mar 2018 00:28:04 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05>
> >> Yes.  We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly
accessabout 200,000 tables.  It is estimated based on a small experiment that each backend will use several to ten GBs
oflocal memory for CacheMemoryContext.  The total memory use will become over 1 TB when the expected maximum
connectionsare used.
 
> >> 
> >> I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache,
plancache,etc?
 
> 
> > This works only for syscaches, which could bloat with entries for
> > nonexistent objects.
> 
> > Plan cache is a utterly deferent thing. It is abandoned at the
> > end of a transaction or such like.
> 
> When I was at Salesforce, we had *substantial* problems with plancache
> bloat.  The driving factor there was plans associated with plpgsql
> functions, which Salesforce had a huge number of.  In an environment
> like that, there would be substantial value in being able to prune
> both the plancache and plpgsql's function cache.  (Note that neither
> of those things are "abandoned at the end of a transaction".)

Mmm. Right. Thanks for pointing it. Anyway plan cache seems to be
a different thing.

> > Relcache is not based on catcache and out of the scope of this
> > patch since it doesn't get bloat with nonexistent entries. It
> > uses dynahash and we could introduce a similar feature to it if
> > we are willing to cap relcache size.
> 
> I think if the case of concern is an application with 200,000 tables,
> it's just nonsense to claim that relcache size isn't an issue.
> 
> In short, it's not really apparent to me that negative syscache entries
> are the major problem of this kind.  I'm afraid that you're drawing very
> large conclusions from a specific workload.  Maybe we could fix that
> workload some other way.

The current patch doesn't consider whether an entry is negative
or positive(?). Just clean up all entries based on time.

If relation has to have the same characterictics to syscaches, it
might be better be on the catcache mechanism, instaed of adding
the same pruning mechanism to dynahash..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Fri, 09 Mar 2018 17:40:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180309.174001.202113825.horiguchi.kyotaro@lab.ntt.co.jp>
> > In short, it's not really apparent to me that negative syscache entries
> > are the major problem of this kind.  I'm afraid that you're drawing very
> > large conclusions from a specific workload.  Maybe we could fix that
> > workload some other way.
> 
> The current patch doesn't consider whether an entry is negative
> or positive(?). Just clean up all entries based on time.
> 
> If relation has to have the same characterictics to syscaches, it
> might be better be on the catcache mechanism, instaed of adding
> the same pruning mechanism to dynahash..

For the moment, I added such feature to dynahash and let only
relcache use it in this patch. Hash element has different shape
in "prunable" hash and pruning is performed in a similar way
sharing the setting with syscache. This seems working fine.

It is bit uneasy that all syscaches and relcache shares the same
value of syscache_memory_target...

Something like the sttached test script causes relcache
"bloat". Server emits the following log entries in DEBUG1 message
level.

DEBUG:  removed 11240/32769 entries from hash "Relcache by OID" at character 15

# The last few words are just garbage I mentioned in another thread.

The last two patches do that (as PoC).

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 705b67a79ef7e27a450083944f8d970b7eb9e619 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Dec 2017 17:43:09 +0900
Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 +++++++
 src/backend/access/transam/xact.c             |   3 +
 src/backend/utils/cache/catcache.c            | 152 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  19 ++++
 6 files changed, 233 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3a8fc7d803..394e0703f8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1557,6 +1557,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dbaaf8e005..86d76917bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -733,6 +733,9 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     stmtStartTimestamp = GetCurrentTimestamp();
+
+    /* Set this timestamp as aproximated current time */
+    SetCatCacheClock(stmtStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 5ddbf6eab1..0236a05127 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,23 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * Let the name be the same with the guc variable name, not using 'catcache'.
+ */
+int syscache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered
+ * to be evicted in seconds. Ditto for the name.
+ */
+int syscache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -866,9 +880,130 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they not accessed for a certain time to prevent catcache from bloating. The
+ * eviction is performed with the similar algorithm with buffer eviction using
+ * access counter. Entries that are accessed several times can live longer
+ * than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * syscache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (syscache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     * Since the area for bucket array is dominant, consider only it.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size < (Size) syscache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > syscache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than syscache_prune_min_age
+             * seconds.  Entries that are not accessed after last pruning are
+             * removed in that seconds, and that has been accessed several
+             * times are removed after leaving alone for up to three times of
+             * the duration. We don't try shrink buckets since pruning
+             * effectively caps catcache expansion in the long term.
+             */
+            if (entry_age > syscache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * syscache_prune_min_age, nentries[0],
+                     ageclass[1] * syscache_prune_min_age, nentries[1],
+                     ageclass[2] * syscache_prune_min_age, nentries[2],
+                     ageclass[3] * syscache_prune_min_age, nentries[3],
+                     ageclass[4] * syscache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1282,6 +1417,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1813,7 +1953,6 @@ ReleaseCatCacheList(CatCList *list)
         CatCacheRemoveCList(list->my_cache, list);
 }
 
-
 /*
  * CatalogCacheCreateEntry
  *        Create a new CatCTup entry, copying the given HeapTuple and other
@@ -1906,6 +2045,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -1913,10 +2054,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CacheHdr->ch_ntup++;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a4f9b3668e..5e0d18657f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -78,6 +78,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
@@ -1972,6 +1973,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"syscache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Syscache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &syscache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"syscache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum duration of an unused syscache entry to remove."),
+            gettext_noop("Syscache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &syscache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39272925fb..5a5729a88f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -124,6 +124,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#syscache_memory_target = 0kB    # in kB. zero disables the feature
+#syscache_prune_min_age = 600s    # -1 disables the feature
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7b22f9c7bc..c3c4d65998 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -119,6 +120,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int syscache_prune_min_age;
+extern int syscache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.2

From 74545dc6f52d42cf93d1353e205bb38581269c5f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Mar 2018 15:52:18 +0900
Subject: [PATCH 2/3] introduce dynhash pruning

---
 src/backend/utils/hash/dynahash.c | 159 +++++++++++++++++++++++++++++++++-----
 src/include/utils/catcache.h      |  12 +++
 src/include/utils/hsearch.h       |  19 ++++-
 3 files changed, 170 insertions(+), 20 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 5281cd5410..5a8b15652a 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -88,6 +88,7 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "utils/catcache.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -184,6 +185,8 @@ struct HASHHDR
     long        ssize;            /* segment size --- must be power of 2 */
     int            sshift;            /* segment shift = log2(ssize) */
     int            nelem_alloc;    /* number of entries to allocate at once */
+    bool        prunable;        /* true if prunable */
+    HASH_PRUNE_CB    prune_cb;    /* pruning callback. see above. */
 
 #ifdef HASH_STATISTICS
 
@@ -227,16 +230,18 @@ struct HTAB
     int            sshift;            /* segment shift = log2(ssize) */
 };
 
+#define HASHELEMENT_SIZE(ctlp) MAXALIGN(ctlp->prunable ? sizeof(PRUNABLE_HASHELEMENT) : sizeof(HASHELEMENT))
+
 /*
  * Key (also entry) part of a HASHELEMENT
  */
-#define ELEMENTKEY(helem)  (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT)))
+#define ELEMENTKEY(helem, ctlp)  (((char *)(helem)) + HASHELEMENT_SIZE(ctlp))
 
 /*
  * Obtain element pointer given pointer to key
  */
-#define ELEMENT_FROM_KEY(key)  \
-    ((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT))))
+#define ELEMENT_FROM_KEY(key, ctlp)                                        \
+    ((HASHELEMENT *) (((char *) (key)) - HASHELEMENT_SIZE(ctlp)))
 
 /*
  * Fast MOD arithmetic, assuming that y is a power of 2 !
@@ -257,6 +262,7 @@ static HASHSEGMENT seg_alloc(HTAB *hashp);
 static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
+static bool prune_entries(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
 static void hdefault(HTAB *hashp);
 static int    choose_nelem_alloc(Size entrysize);
@@ -497,6 +503,17 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
         hctl->entrysize = info->entrysize;
     }
 
+    /*
+     * hash table runs pruning
+     */
+    if (flags & HASH_PRUNABLE)
+    {
+        hctl->prunable = true;
+        hctl->prune_cb = info->prune_cb;
+    }
+    else
+        hctl->prunable = false;
+
     /* make local copies of heavily-used constant fields */
     hashp->keysize = hctl->keysize;
     hashp->ssize = hctl->ssize;
@@ -982,7 +999,7 @@ hash_search_with_hash_value(HTAB *hashp,
     while (currBucket != NULL)
     {
         if (currBucket->hashvalue == hashvalue &&
-            match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
+            match(ELEMENTKEY(currBucket, hctl), keyPtr, keysize) == 0)
             break;
         prevBucketPtr = &(currBucket->link);
         currBucket = *prevBucketPtr;
@@ -995,6 +1012,17 @@ hash_search_with_hash_value(HTAB *hashp,
     if (foundPtr)
         *foundPtr = (bool) (currBucket != NULL);
 
+    /* Update access counter if needed */
+    if (hctl->prunable && currBucket &&
+        (action == HASH_FIND || action == HASH_ENTER))
+    {
+        PRUNABLE_HASHELEMENT *prunable_elm =
+            (PRUNABLE_HASHELEMENT *) currBucket;
+        if (prunable_elm->naccess < 2)
+            prunable_elm->naccess++;
+        prunable_elm->last_access = GetCatCacheClock();
+    }
+
     /*
      * OK, now what?
      */
@@ -1002,7 +1030,8 @@ hash_search_with_hash_value(HTAB *hashp,
     {
         case HASH_FIND:
             if (currBucket != NULL)
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
+
             return NULL;
 
         case HASH_REMOVE:
@@ -1031,7 +1060,7 @@ hash_search_with_hash_value(HTAB *hashp,
                  * element, because someone else is going to reuse it the next
                  * time something is added to the table
                  */
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
             }
             return NULL;
 
@@ -1043,7 +1072,7 @@ hash_search_with_hash_value(HTAB *hashp,
         case HASH_ENTER:
             /* Return existing element if found, else create one */
             if (currBucket != NULL)
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
 
             /* disallow inserts if frozen */
             if (hashp->frozen)
@@ -1073,8 +1102,18 @@ hash_search_with_hash_value(HTAB *hashp,
 
             /* copy key into record */
             currBucket->hashvalue = hashvalue;
-            hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize);
+            hashp->keycopy(ELEMENTKEY(currBucket, hctl), keyPtr, keysize);
 
+            /* set access counter */
+            if (hctl->prunable)
+            {
+                PRUNABLE_HASHELEMENT *prunable_elm =
+                    (PRUNABLE_HASHELEMENT *) currBucket;
+                if (prunable_elm->naccess < 2)
+                    prunable_elm->naccess++;
+                prunable_elm->last_access = GetCatCacheClock();
+            }
+            
             /*
              * Caller is expected to fill the data field on return.  DO NOT
              * insert any code that could possibly throw error here, as doing
@@ -1082,7 +1121,7 @@ hash_search_with_hash_value(HTAB *hashp,
              * caller's data structure.
              */
 
-            return (void *) ELEMENTKEY(currBucket);
+            return (void *) ELEMENTKEY(currBucket, hctl);
     }
 
     elog(ERROR, "unrecognized hash action code: %d", (int) action);
@@ -1114,7 +1153,7 @@ hash_update_hash_key(HTAB *hashp,
                      void *existingEntry,
                      const void *newKeyPtr)
 {
-    HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry);
+    HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry, hashp->hctl);
     HASHHDR    *hctl = hashp->hctl;
     uint32        newhashvalue;
     Size        keysize;
@@ -1198,7 +1237,7 @@ hash_update_hash_key(HTAB *hashp,
     while (currBucket != NULL)
     {
         if (currBucket->hashvalue == newhashvalue &&
-            match(ELEMENTKEY(currBucket), newKeyPtr, keysize) == 0)
+            match(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize) == 0)
             break;
         prevBucketPtr = &(currBucket->link);
         currBucket = *prevBucketPtr;
@@ -1232,7 +1271,7 @@ hash_update_hash_key(HTAB *hashp,
 
     /* copy new key into record */
     currBucket->hashvalue = newhashvalue;
-    hashp->keycopy(ELEMENTKEY(currBucket), newKeyPtr, keysize);
+    hashp->keycopy(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize);
 
     /* rest of record is untouched */
 
@@ -1386,8 +1425,8 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp)
 void *
 hash_seq_search(HASH_SEQ_STATUS *status)
 {
-    HTAB       *hashp;
-    HASHHDR    *hctl;
+    HTAB       *hashp = status->hashp;
+    HASHHDR    *hctl = hashp->hctl;
     uint32        max_bucket;
     long        ssize;
     long        segment_num;
@@ -1402,15 +1441,13 @@ hash_seq_search(HASH_SEQ_STATUS *status)
         status->curEntry = curElem->link;
         if (status->curEntry == NULL)    /* end of this bucket */
             ++status->curBucket;
-        return (void *) ELEMENTKEY(curElem);
+        return (void *) ELEMENTKEY(curElem, hctl);
     }
 
     /*
      * Search for next nonempty bucket starting at curBucket.
      */
     curBucket = status->curBucket;
-    hashp = status->hashp;
-    hctl = hashp->hctl;
     ssize = hashp->ssize;
     max_bucket = hctl->max_bucket;
 
@@ -1456,7 +1493,7 @@ hash_seq_search(HASH_SEQ_STATUS *status)
     if (status->curEntry == NULL)    /* end of this bucket */
         ++curBucket;
     status->curBucket = curBucket;
-    return (void *) ELEMENTKEY(curElem);
+    return (void *) ELEMENTKEY(curElem, hctl);
 }
 
 void
@@ -1550,6 +1587,10 @@ expand_table(HTAB *hashp)
      */
     if ((uint32) new_bucket > hctl->high_mask)
     {
+        /* try pruning before expansion. return true on success */
+        if (hctl->prunable && prune_entries(hashp))
+            return true;
+
         hctl->low_mask = hctl->high_mask;
         hctl->high_mask = (uint32) new_bucket | hctl->low_mask;
     }
@@ -1592,6 +1633,86 @@ expand_table(HTAB *hashp)
     return true;
 }
 
+static bool
+prune_entries(HTAB *hashp)
+{
+    HASHHDR           *hctl = hashp->hctl;
+    HASH_SEQ_STATUS status;
+    void            *elm;
+    TimestampTz        currclock = GetCatCacheClock();
+    int                nall = 0,
+                    nremoved = 0;
+
+    Assert(hctl->prunable);
+
+    /* not called for frozen or under seqscan. see
+     * hash_search_with_hash_value. */
+    Assert(IS_PARTITIONED(hctl) ||
+        hashp->frozen ||
+        hctl->freeList[0].nentries / (long) (hctl->max_bucket + 1) <
+        hctl->ffactor ||
+        has_seq_scans(hashp));
+
+    /* This setting prevents pruning */
+    if (syscache_prune_min_age < 0)
+        return false;
+
+    /*
+     * return false immediately if this hash is small enough. We only consider
+     * bucket array size since it is the significant part of memory usage.
+     * settings is shared with syscache
+     */
+    if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize <
+        (Size) syscache_memory_target * 1024L)
+        return false;
+
+    /*
+     * Ok, this hash can be pruned. start pruning. This function is called
+     * early enough for doing this via public API.
+     */
+    hash_seq_init(&status, hashp);
+    while ((elm = hash_seq_search(&status)) != NULL)
+    {
+        PRUNABLE_HASHELEMENT *helm =
+            (PRUNABLE_HASHELEMENT *)ELEMENT_FROM_KEY(elm, hctl);
+        long    entry_age;
+        int        us;
+
+        nall++;
+
+        TimestampDifference(helm->last_access, currclock, &entry_age, &us);
+
+        /* settings is shared with syscache */
+        if (entry_age > syscache_prune_min_age)
+        {
+            /* Wait for the next chance if this is recently used */
+            if (helm->naccess > 0)
+                helm->naccess--;
+            else
+            {
+                /* just call it if callback is provided, remove otherwise */
+                if (hctl->prune_cb)
+                {
+                    if (hctl->prune_cb(hashp, (void *)elm))
+                        nremoved++;
+                }
+                else
+                {
+                    bool found;
+                    
+                    hash_search(hashp, elm, HASH_REMOVE, &found);
+                    Assert(found);
+                    nremoved++;
+                }
+            }
+        }
+    }
+
+    elog(DEBUG1, "removed %d/%d entries from hash \"%s\"",
+         nremoved, nall, hashp->tabname);
+
+    return nremoved > 0;
+}
 
 static bool
 dir_realloc(HTAB *hashp)
@@ -1665,7 +1786,7 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
         return false;
 
     /* Each element has a HASHELEMENT header plus user data. */
-    elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+    elementSize = HASHELEMENT_SIZE(hctl) + MAXALIGN(hctl->entrysize);
 
     CurrentDynaHashCxt = hashp->hcxt;
     firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index c3c4d65998..fcc680bb82 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -208,6 +208,18 @@ SetCatCacheClock(TimestampTz ts)
     catcacheclock = ts;
 }
 
+/*
+ * GetCatCacheClock - get timestamp for catcache access record
+ *
+ * This clock is basically provided for catcache usage, but dynahash has a
+ * similar pruning mechanism and wants to use the same clock.
+ */
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 8357faac5a..df12352a46 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -13,7 +13,7 @@
  */
 #ifndef HSEARCH_H
 #define HSEARCH_H
-
+#include "datatype/timestamp.h"
 
 /*
  * Hash functions must have this signature.
@@ -47,6 +47,7 @@ typedef void *(*HashAllocFunc) (Size request);
  * HASHELEMENT is the private part of a hashtable entry.  The caller's data
  * follows the HASHELEMENT structure (on a MAXALIGN'd boundary).  The hash key
  * is expected to be at the start of the caller's hash entry data structure.
+ * If this hash is prunable, PRUNABLE_HASHELEMENT is used instead.
  */
 typedef struct HASHELEMENT
 {
@@ -54,12 +55,26 @@ typedef struct HASHELEMENT
     uint32        hashvalue;        /* hash function result for this entry */
 } HASHELEMENT;
 
+typedef struct PRUNABLE_HASHELEMENT
+{
+    struct HASHELEMENT *link;    /* link to next entry in same bucket */
+    uint32        hashvalue;        /* hash function result for this entry */
+    TimestampTz    last_access;    /* timestamp of the last usage */
+    int            naccess;        /* takes 0 to 2, counted up when used */
+} PRUNABLE_HASHELEMENT;
+
 /* Hash table header struct is an opaque type known only within dynahash.c */
 typedef struct HASHHDR HASHHDR;
 
 /* Hash table control struct is an opaque type known only within dynahash.c */
 typedef struct HTAB HTAB;
 
+/*
+ * Hash pruning callback. This is called for the entries which is about to be
+ * removed without the owner's intention.
+ */
+typedef bool (*HASH_PRUNE_CB)(HTAB *hashp, void *ent);
+
 /* Parameter data structure for hash_create */
 /* Only those fields indicated by hash_flags need be set */
 typedef struct HASHCTL
@@ -77,6 +92,7 @@ typedef struct HASHCTL
     HashAllocFunc alloc;        /* memory allocator */
     MemoryContext hcxt;            /* memory context to use for allocations */
     HASHHDR    *hctl;            /* location of header in shared mem */
+    HASH_PRUNE_CB    prune_cb;    /* pruning callback. see above. */
 } HASHCTL;
 
 /* Flags to indicate which parameters are supplied */
@@ -94,6 +110,7 @@ typedef struct HASHCTL
 #define HASH_SHARED_MEM 0x0800    /* Hashtable is in shared memory */
 #define HASH_ATTACH        0x1000    /* Do not initialize hctl */
 #define HASH_FIXED_SIZE 0x2000    /* Initial size is a hard limit */
+#define HASH_PRUNABLE    0x4000    /* pruning setting */
 
 
 /* max_dsize value to indicate expansible directory */
-- 
2.16.2

From debface28e2261b0d819c46e52942ba500143581 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Mar 2018 17:31:43 +0900
Subject: [PATCH 3/3] Apply purning to relcache

---
 src/backend/utils/cache/relcache.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9ee78f885f..f344771d57 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3503,6 +3503,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
 
 #define INITRELCACHESIZE        400
 
+/* callback function for hash pruning */
+static bool
+relcache_prune_cb(HTAB *hashp, void *ent)
+{
+    RelIdCacheEnt  *relent = (RelIdCacheEnt *) ent;
+    Relation        relation;
+
+    /* this relation is requested to be removed.  */
+    RelationIdCacheLookup(relent->reloid, relation);
+
+    /* but cannot removed an active cache entry */
+    if (!RelationHasReferenceCountZero(relation))
+        return false;
+
+    /*
+     * Otherwise we are allowd to forget it unconditionally. see
+     * RelationForgetRelation
+     */
+    RelationClearRelation(relation, false);
+
+    return true;
+}
+
 void
 RelationCacheInitialize(void)
 {
@@ -3520,8 +3543,11 @@ RelationCacheInitialize(void)
     MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RelIdCacheEnt);
+
+    /* use the same setting with syscache */
+    ctl.prune_cb = relcache_prune_cb;
     RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE,
-                                  &ctl, HASH_ELEM | HASH_BLOBS);
+                                  &ctl, HASH_ELEM | HASH_BLOBS | HASH_PRUNABLE);
 
     /*
      * relation mapper needs to be initialized too
-- 
2.16.2


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Oops.

At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180312.173408.162882093.horiguchi.kyotaro@lab.ntt.co.jp>
> Something like the sttached test script causes relcache

This is that.

#! /usr/bin/perl


# printf("drop schema if exists test_schema;\n", $i);
printf("create schema test_schema;\n", $i);
printf("create table test_schema.t%06d ();\n", $i);

for $i (0..100000) {
    printf("create table test_schema.t%06d ();\n", $i);
}

printf("set syscache_memory_target = \'1kB\';\n");
printf("set syscache_prune_min_age = \'15s\';\n");

for $i (0..100000) {
    printf("select * from test_schema.t%06d;\n", $i);
}



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180312.173408.162882093.horiguchi.kyotaro@lab.ntt.co.jp>
> > > In short, it's not really apparent to me that negative syscache entries
> > > are the major problem of this kind.  I'm afraid that you're drawing very
> > > large conclusions from a specific workload.  Maybe we could fix that
> > > workload some other way.
> > 
> > The current patch doesn't consider whether an entry is negative
> > or positive(?). Just clean up all entries based on time.
> > 
> > If relation has to have the same characterictics to syscaches, it
> > might be better be on the catcache mechanism, instaed of adding
> > the same pruning mechanism to dynahash..
> 
> For the moment, I added such feature to dynahash and let only
> relcache use it in this patch. Hash element has different shape
> in "prunable" hash and pruning is performed in a similar way
> sharing the setting with syscache. This seems working fine.

I gave consideration on plancache. The most different
characteristics from catcache and relcache is the fact that it is
not voluntarily removable since CachedPlanSource, the root struct
of a plan cache, holds some indispensable inforamtion. In regards
to prepared queries, even if we store the information into
another location, for example in "Prepred Queries" hash, it
merely moving a big data into another place.

Looking into CachedPlanSoruce, generic plan is a part that is
safely removable since it is rebuilt as necessary. Keeping "old"
plancache entries not holding a generic plan can reduce memory
usage.

For testing purpose, I made 50000 parepared statement like
"select sum(c) from p where e < $" on 100 partitions,

With disabling the feature (0004 patch) VSZ of the backend
exceeds 3GB (It is still increasing at the moment), while it
stops to increase at about 997MB for min_cached_plans = 1000 and
plancache_prune_min_age = '10s'.

# 10s is apparently short for acutual use, of course.

It is expected to be significant amount if the plan is large
enough but I'm still not sure it is worth doing, or is a right
way.


The attached is the patch set including this plancache stuff.

0001- catcache time-based expiration (The origin of this thread)
0002- introduces dynahash pruning feature
0003- implement relcache pruning using 0002
0004- (perhaps) independent from the three above. PoC of
      plancache pruning. Details are shown above.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 705b67a79ef7e27a450083944f8d970b7eb9e619 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Dec 2017 17:43:09 +0900
Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 +++++++
 src/backend/access/transam/xact.c             |   3 +
 src/backend/utils/cache/catcache.c            | 152 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  19 ++++
 6 files changed, 233 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3a8fc7d803..394e0703f8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1557,6 +1557,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index dbaaf8e005..86d76917bb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -733,6 +733,9 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     stmtStartTimestamp = GetCurrentTimestamp();
+
+    /* Set this timestamp as aproximated current time */
+    SetCatCacheClock(stmtStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 5ddbf6eab1..0236a05127 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,23 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * Let the name be the same with the guc variable name, not using 'catcache'.
+ */
+int syscache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered
+ * to be evicted in seconds. Ditto for the name.
+ */
+int syscache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -866,9 +880,130 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they not accessed for a certain time to prevent catcache from bloating. The
+ * eviction is performed with the similar algorithm with buffer eviction using
+ * access counter. Entries that are accessed several times can live longer
+ * than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * syscache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (syscache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     * Since the area for bucket array is dominant, consider only it.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size < (Size) syscache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > syscache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than syscache_prune_min_age
+             * seconds.  Entries that are not accessed after last pruning are
+             * removed in that seconds, and that has been accessed several
+             * times are removed after leaving alone for up to three times of
+             * the duration. We don't try shrink buckets since pruning
+             * effectively caps catcache expansion in the long term.
+             */
+            if (entry_age > syscache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * syscache_prune_min_age, nentries[0],
+                     ageclass[1] * syscache_prune_min_age, nentries[1],
+                     ageclass[2] * syscache_prune_min_age, nentries[2],
+                     ageclass[3] * syscache_prune_min_age, nentries[3],
+                     ageclass[4] * syscache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1282,6 +1417,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1813,7 +1953,6 @@ ReleaseCatCacheList(CatCList *list)
         CatCacheRemoveCList(list->my_cache, list);
 }
 
-
 /*
  * CatalogCacheCreateEntry
  *        Create a new CatCTup entry, copying the given HeapTuple and other
@@ -1906,6 +2045,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -1913,10 +2054,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CacheHdr->ch_ntup++;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a4f9b3668e..5e0d18657f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -78,6 +78,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
@@ -1972,6 +1973,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"syscache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Syscache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &syscache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"syscache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum duration of an unused syscache entry to remove."),
+            gettext_noop("Syscache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &syscache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 39272925fb..5a5729a88f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -124,6 +124,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#syscache_memory_target = 0kB    # in kB. zero disables the feature
+#syscache_prune_min_age = 600s    # -1 disables the feature
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7b22f9c7bc..c3c4d65998 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -119,6 +120,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int syscache_prune_min_age;
+extern int syscache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.2

From 037f3534f5274eb7bcdb5adee262b5af624175e2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Mar 2018 15:52:18 +0900
Subject: [PATCH 2/4] introduce dynhash pruning

---
 src/backend/utils/hash/dynahash.c | 169 +++++++++++++++++++++++++++++++++-----
 src/include/utils/catcache.h      |  12 +++
 src/include/utils/hsearch.h       |  21 ++++-
 3 files changed, 182 insertions(+), 20 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 5281cd5410..a5b4979662 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -88,6 +88,7 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "utils/catcache.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -184,6 +185,10 @@ struct HASHHDR
     long        ssize;            /* segment size --- must be power of 2 */
     int            sshift;            /* segment shift = log2(ssize) */
     int            nelem_alloc;    /* number of entries to allocate at once */
+    bool        prunable;        /* true if prunable */
+    HASH_PRUNE_CB    prune_cb;    /* pruning callback. see above. */
+    int           *memory_target;    /* pointer to memory target */
+    int           *prune_min_age;    /* pointer to prune minimum age */
 
 #ifdef HASH_STATISTICS
 
@@ -227,16 +232,18 @@ struct HTAB
     int            sshift;            /* segment shift = log2(ssize) */
 };
 
+#define HASHELEMENT_SIZE(ctlp) MAXALIGN(ctlp->prunable ? sizeof(PRUNABLE_HASHELEMENT) : sizeof(HASHELEMENT))
+
 /*
  * Key (also entry) part of a HASHELEMENT
  */
-#define ELEMENTKEY(helem)  (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT)))
+#define ELEMENTKEY(helem, ctlp)  (((char *)(helem)) + HASHELEMENT_SIZE(ctlp))
 
 /*
  * Obtain element pointer given pointer to key
  */
-#define ELEMENT_FROM_KEY(key)  \
-    ((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT))))
+#define ELEMENT_FROM_KEY(key, ctlp)                                        \
+    ((HASHELEMENT *) (((char *) (key)) - HASHELEMENT_SIZE(ctlp)))
 
 /*
  * Fast MOD arithmetic, assuming that y is a power of 2 !
@@ -257,6 +264,7 @@ static HASHSEGMENT seg_alloc(HTAB *hashp);
 static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
+static bool prune_entries(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
 static void hdefault(HTAB *hashp);
 static int    choose_nelem_alloc(Size entrysize);
@@ -497,6 +505,25 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
         hctl->entrysize = info->entrysize;
     }
 
+    /*
+     * hash table runs pruning
+     */
+    if (flags & HASH_PRUNABLE)
+    {
+        hctl->prunable = true;
+        hctl->prune_cb = info->prune_cb;
+        if (info->memory_target)
+            hctl->memory_target = info->memory_target;
+        else
+            hctl->memory_target = &syscache_memory_target;
+        if (info->prune_min_age)
+            hctl->prune_min_age = info->prune_min_age;
+        else
+            hctl->prune_min_age = &syscache_prune_min_age;
+    }
+    else
+        hctl->prunable = false;
+
     /* make local copies of heavily-used constant fields */
     hashp->keysize = hctl->keysize;
     hashp->ssize = hctl->ssize;
@@ -982,7 +1009,7 @@ hash_search_with_hash_value(HTAB *hashp,
     while (currBucket != NULL)
     {
         if (currBucket->hashvalue == hashvalue &&
-            match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
+            match(ELEMENTKEY(currBucket, hctl), keyPtr, keysize) == 0)
             break;
         prevBucketPtr = &(currBucket->link);
         currBucket = *prevBucketPtr;
@@ -995,6 +1022,17 @@ hash_search_with_hash_value(HTAB *hashp,
     if (foundPtr)
         *foundPtr = (bool) (currBucket != NULL);
 
+    /* Update access counter if needed */
+    if (hctl->prunable && currBucket &&
+        (action == HASH_FIND || action == HASH_ENTER))
+    {
+        PRUNABLE_HASHELEMENT *prunable_elm =
+            (PRUNABLE_HASHELEMENT *) currBucket;
+        if (prunable_elm->naccess < 2)
+            prunable_elm->naccess++;
+        prunable_elm->last_access = GetCatCacheClock();
+    }
+
     /*
      * OK, now what?
      */
@@ -1002,7 +1040,8 @@ hash_search_with_hash_value(HTAB *hashp,
     {
         case HASH_FIND:
             if (currBucket != NULL)
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
+
             return NULL;
 
         case HASH_REMOVE:
@@ -1031,7 +1070,7 @@ hash_search_with_hash_value(HTAB *hashp,
                  * element, because someone else is going to reuse it the next
                  * time something is added to the table
                  */
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
             }
             return NULL;
 
@@ -1043,7 +1082,7 @@ hash_search_with_hash_value(HTAB *hashp,
         case HASH_ENTER:
             /* Return existing element if found, else create one */
             if (currBucket != NULL)
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
 
             /* disallow inserts if frozen */
             if (hashp->frozen)
@@ -1073,8 +1112,18 @@ hash_search_with_hash_value(HTAB *hashp,
 
             /* copy key into record */
             currBucket->hashvalue = hashvalue;
-            hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize);
+            hashp->keycopy(ELEMENTKEY(currBucket, hctl), keyPtr, keysize);
 
+            /* set access counter */
+            if (hctl->prunable)
+            {
+                PRUNABLE_HASHELEMENT *prunable_elm =
+                    (PRUNABLE_HASHELEMENT *) currBucket;
+                if (prunable_elm->naccess < 2)
+                    prunable_elm->naccess++;
+                prunable_elm->last_access = GetCatCacheClock();
+            }
+            
             /*
              * Caller is expected to fill the data field on return.  DO NOT
              * insert any code that could possibly throw error here, as doing
@@ -1082,7 +1131,7 @@ hash_search_with_hash_value(HTAB *hashp,
              * caller's data structure.
              */
 
-            return (void *) ELEMENTKEY(currBucket);
+            return (void *) ELEMENTKEY(currBucket, hctl);
     }
 
     elog(ERROR, "unrecognized hash action code: %d", (int) action);
@@ -1114,7 +1163,7 @@ hash_update_hash_key(HTAB *hashp,
                      void *existingEntry,
                      const void *newKeyPtr)
 {
-    HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry);
+    HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry, hashp->hctl);
     HASHHDR    *hctl = hashp->hctl;
     uint32        newhashvalue;
     Size        keysize;
@@ -1198,7 +1247,7 @@ hash_update_hash_key(HTAB *hashp,
     while (currBucket != NULL)
     {
         if (currBucket->hashvalue == newhashvalue &&
-            match(ELEMENTKEY(currBucket), newKeyPtr, keysize) == 0)
+            match(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize) == 0)
             break;
         prevBucketPtr = &(currBucket->link);
         currBucket = *prevBucketPtr;
@@ -1232,7 +1281,7 @@ hash_update_hash_key(HTAB *hashp,
 
     /* copy new key into record */
     currBucket->hashvalue = newhashvalue;
-    hashp->keycopy(ELEMENTKEY(currBucket), newKeyPtr, keysize);
+    hashp->keycopy(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize);
 
     /* rest of record is untouched */
 
@@ -1386,8 +1435,8 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp)
 void *
 hash_seq_search(HASH_SEQ_STATUS *status)
 {
-    HTAB       *hashp;
-    HASHHDR    *hctl;
+    HTAB       *hashp = status->hashp;
+    HASHHDR    *hctl = hashp->hctl;
     uint32        max_bucket;
     long        ssize;
     long        segment_num;
@@ -1402,15 +1451,13 @@ hash_seq_search(HASH_SEQ_STATUS *status)
         status->curEntry = curElem->link;
         if (status->curEntry == NULL)    /* end of this bucket */
             ++status->curBucket;
-        return (void *) ELEMENTKEY(curElem);
+        return (void *) ELEMENTKEY(curElem, hctl);
     }
 
     /*
      * Search for next nonempty bucket starting at curBucket.
      */
     curBucket = status->curBucket;
-    hashp = status->hashp;
-    hctl = hashp->hctl;
     ssize = hashp->ssize;
     max_bucket = hctl->max_bucket;
 
@@ -1456,7 +1503,7 @@ hash_seq_search(HASH_SEQ_STATUS *status)
     if (status->curEntry == NULL)    /* end of this bucket */
         ++curBucket;
     status->curBucket = curBucket;
-    return (void *) ELEMENTKEY(curElem);
+    return (void *) ELEMENTKEY(curElem, hctl);
 }
 
 void
@@ -1550,6 +1597,10 @@ expand_table(HTAB *hashp)
      */
     if ((uint32) new_bucket > hctl->high_mask)
     {
+        /* try pruning before expansion. return true on success */
+        if (hctl->prunable && prune_entries(hashp))
+            return true;
+
         hctl->low_mask = hctl->high_mask;
         hctl->high_mask = (uint32) new_bucket | hctl->low_mask;
     }
@@ -1592,6 +1643,86 @@ expand_table(HTAB *hashp)
     return true;
 }
 
+static bool
+prune_entries(HTAB *hashp)
+{
+    HASHHDR           *hctl = hashp->hctl;
+    HASH_SEQ_STATUS status;
+    void            *elm;
+    TimestampTz        currclock = GetCatCacheClock();
+    int                nall = 0,
+                    nremoved = 0;
+
+    Assert(hctl->prunable);
+
+    /* not called for frozen or under seqscan. see
+     * hash_search_with_hash_value. */
+    Assert(IS_PARTITIONED(hctl) ||
+        hashp->frozen ||
+        hctl->freeList[0].nentries / (long) (hctl->max_bucket + 1) <
+        hctl->ffactor ||
+        has_seq_scans(hashp));
+
+    /* This setting prevents pruning */
+    if (*hctl->prune_min_age < 0)
+        return false;
+
+    /*
+     * return false immediately if this hash is small enough. We only consider
+     * bucket array size since it is the significant part of memory usage.
+     * settings is shared with syscache
+     */
+    if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize <
+        (Size) *hctl->memory_target * 1024L)
+        return false;
+
+    /*
+     * Ok, this hash can be pruned. start pruning. This function is called
+     * early enough for doing this via public API.
+     */
+    hash_seq_init(&status, hashp);
+    while ((elm = hash_seq_search(&status)) != NULL)
+    {
+        PRUNABLE_HASHELEMENT *helm =
+            (PRUNABLE_HASHELEMENT *)ELEMENT_FROM_KEY(elm, hctl);
+        long    entry_age;
+        int        us;
+
+        nall++;
+
+        TimestampDifference(helm->last_access, currclock, &entry_age, &us);
+
+        /* settings is shared with syscache */
+        if (entry_age > *hctl->prune_min_age)
+        {
+            /* Wait for the next chance if this is recently used */
+            if (helm->naccess > 0)
+                helm->naccess--;
+            else
+            {
+                /* just call it if callback is provided, remove otherwise */
+                if (hctl->prune_cb)
+                {
+                    if (hctl->prune_cb(hashp, (void *)elm))
+                        nremoved++;
+                }
+                else
+                {
+                    bool found;
+                    
+                    hash_search(hashp, elm, HASH_REMOVE, &found);
+                    Assert(found);
+                    nremoved++;
+                }
+            }
+        }
+    }
+
+    elog(DEBUG1, "removed %d/%d entries from hash \"%s\"",
+         nremoved, nall, hashp->tabname);
+
+    return nremoved > 0;
+}
 
 static bool
 dir_realloc(HTAB *hashp)
@@ -1665,7 +1796,7 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
         return false;
 
     /* Each element has a HASHELEMENT header plus user data. */
-    elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+    elementSize = HASHELEMENT_SIZE(hctl) + MAXALIGN(hctl->entrysize);
 
     CurrentDynaHashCxt = hashp->hcxt;
     firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index c3c4d65998..fcc680bb82 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -208,6 +208,18 @@ SetCatCacheClock(TimestampTz ts)
     catcacheclock = ts;
 }
 
+/*
+ * GetCatCacheClock - get timestamp for catcache access record
+ *
+ * This clock is basically provided for catcache usage, but dynahash has a
+ * similar pruning mechanism and wants to use the same clock.
+ */
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 8357faac5a..7ea3c75423 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -13,7 +13,7 @@
  */
 #ifndef HSEARCH_H
 #define HSEARCH_H
-
+#include "datatype/timestamp.h"
 
 /*
  * Hash functions must have this signature.
@@ -47,6 +47,7 @@ typedef void *(*HashAllocFunc) (Size request);
  * HASHELEMENT is the private part of a hashtable entry.  The caller's data
  * follows the HASHELEMENT structure (on a MAXALIGN'd boundary).  The hash key
  * is expected to be at the start of the caller's hash entry data structure.
+ * If this hash is prunable, PRUNABLE_HASHELEMENT is used instead.
  */
 typedef struct HASHELEMENT
 {
@@ -54,12 +55,26 @@ typedef struct HASHELEMENT
     uint32        hashvalue;        /* hash function result for this entry */
 } HASHELEMENT;
 
+typedef struct PRUNABLE_HASHELEMENT
+{
+    struct HASHELEMENT *link;    /* link to next entry in same bucket */
+    uint32        hashvalue;        /* hash function result for this entry */
+    TimestampTz    last_access;    /* timestamp of the last usage */
+    int            naccess;        /* takes 0 to 2, counted up when used */
+} PRUNABLE_HASHELEMENT;
+
 /* Hash table header struct is an opaque type known only within dynahash.c */
 typedef struct HASHHDR HASHHDR;
 
 /* Hash table control struct is an opaque type known only within dynahash.c */
 typedef struct HTAB HTAB;
 
+/*
+ * Hash pruning callback. This is called for the entries which is about to be
+ * removed without the owner's intention.
+ */
+typedef bool (*HASH_PRUNE_CB)(HTAB *hashp, void *ent);
+
 /* Parameter data structure for hash_create */
 /* Only those fields indicated by hash_flags need be set */
 typedef struct HASHCTL
@@ -77,6 +92,9 @@ typedef struct HASHCTL
     HashAllocFunc alloc;        /* memory allocator */
     MemoryContext hcxt;            /* memory context to use for allocations */
     HASHHDR    *hctl;            /* location of header in shared mem */
+    HASH_PRUNE_CB    prune_cb;    /* pruning callback. see above. */
+    int           *memory_target;    /* pointer to memory target */
+    int           *prune_min_age;    /* pointer to prune minimum age */
 } HASHCTL;
 
 /* Flags to indicate which parameters are supplied */
@@ -94,6 +112,7 @@ typedef struct HASHCTL
 #define HASH_SHARED_MEM 0x0800    /* Hashtable is in shared memory */
 #define HASH_ATTACH        0x1000    /* Do not initialize hctl */
 #define HASH_FIXED_SIZE 0x2000    /* Initial size is a hard limit */
+#define HASH_PRUNABLE    0x4000    /* pruning setting */
 
 
 /* max_dsize value to indicate expansible directory */
-- 
2.16.2

From 94c85baed46e1a8330af7d664c44289d97d6df26 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Mar 2018 17:31:43 +0900
Subject: [PATCH 3/4] Apply purning to relcache

---
 src/backend/utils/cache/relcache.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9ee78f885f..da9ecee15b 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3503,6 +3503,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
 
 #define INITRELCACHESIZE        400
 
+/* callback function for hash pruning */
+static bool
+relcache_prune_cb(HTAB *hashp, void *ent)
+{
+    RelIdCacheEnt  *relent = (RelIdCacheEnt *) ent;
+    Relation        relation;
+
+    /* this relation is requested to be removed.  */
+    RelationIdCacheLookup(relent->reloid, relation);
+
+    /* but cannot remove cache entries currently in use */
+    if (!RelationHasReferenceCountZero(relation))
+        return false;
+
+    /*
+     * Otherwise we are allowd to forget it unconditionally. see
+     * RelationForgetRelation
+     */
+    RelationClearRelation(relation, false);
+
+    return true;
+}
+
 void
 RelationCacheInitialize(void)
 {
@@ -3520,8 +3543,11 @@ RelationCacheInitialize(void)
     MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RelIdCacheEnt);
+
+    /* use the same setting with syscache */
+    ctl.prune_cb = relcache_prune_cb;
     RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE,
-                                  &ctl, HASH_ELEM | HASH_BLOBS);
+                                  &ctl, HASH_ELEM | HASH_BLOBS | HASH_PRUNABLE);
 
     /*
      * relation mapper needs to be initialized too
-- 
2.16.2

From 89bb807c11ec411d1e25b0aa03792ae341435fec Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 13 Mar 2018 17:29:32 +0900
Subject: [PATCH 4/4] PoC of generic plan removal of plancachesource.

---
 src/backend/utils/cache/plancache.c | 157 ++++++++++++++++++++++++++++++++++++
 src/backend/utils/hash/dynahash.c   |  16 +++-
 src/backend/utils/misc/guc.c        |  21 +++++
 src/backend/utils/mmgr/mcxt.c       |   1 +
 src/include/commands/prepare.h      |   4 +
 src/include/utils/hsearch.h         |   2 +
 src/include/utils/plancache.h       |  14 +++-
 7 files changed, 208 insertions(+), 7 deletions(-)

diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c
index 8d7d8e04c9..9e34e4098e 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -63,12 +63,14 @@
 #include "storage/lmgr.h"
 #include "tcop/pquery.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
 #include "utils/rls.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
 /*
@@ -86,6 +88,13 @@
  * guarantee to save a CachedPlanSource without error.
  */
 static CachedPlanSource *first_saved_plan = NULL;
+static CachedPlanSource *last_saved_plan = NULL;
+static int                 num_saved_plans = 0;
+static TimestampTz         oldest_saved_plan = 0;
+
+/* GUC variables */
+int                         min_cached_plans = 1000;
+int                         plancache_prune_min_age = 600;
 
 static void ReleaseGenericPlan(CachedPlanSource *plansource);
 static List *RevalidateCachedQuery(CachedPlanSource *plansource,
@@ -105,6 +114,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list);
 static void PlanCacheRelCallback(Datum arg, Oid relid);
 static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue);
 static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue);
+static void PruneCachedPlan(void);
 
 
 /*
@@ -207,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree,
     plansource->generic_cost = -1;
     plansource->total_custom_cost = 0;
     plansource->num_custom_plans = 0;
+    plansource->last_access = GetCatCacheClock();
+    
 
     MemoryContextSwitchTo(oldcxt);
 
@@ -422,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource,
     plansource->is_valid = true;
 }
 
+/* moves the plansource to the first in the list */
+static inline void
+MovePlansourceToFirst(CachedPlanSource *plansource)
+{
+    if (first_saved_plan != plansource)
+    {
+        /* delink this element */
+        if (plansource->next_saved)
+            plansource->next_saved->prev_saved = plansource->prev_saved;
+        if (plansource->prev_saved)
+            plansource->prev_saved->next_saved = plansource->next_saved;
+        if (last_saved_plan == plansource)
+            last_saved_plan = plansource->prev_saved;
+
+        /* insert at the beginning */
+        first_saved_plan->prev_saved = plansource;
+        plansource->next_saved = first_saved_plan;
+        plansource->prev_saved = NULL;
+        first_saved_plan = plansource;
+    }
+}
+
 /*
  * SaveCachedPlan: save a cached plan permanently
  *
@@ -469,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource)
      * Add the entry to the global list of cached plans.
      */
     plansource->next_saved = first_saved_plan;
+    if (first_saved_plan)
+        first_saved_plan->prev_saved = plansource;
+    else
+        last_saved_plan = plansource;
+    plansource->prev_saved = NULL;
     first_saved_plan = plansource;
 
     plansource->is_saved = true;
@@ -491,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource)
     if (plansource->is_saved)
     {
         if (first_saved_plan == plansource)
+        {
             first_saved_plan = plansource->next_saved;
+            if (first_saved_plan)
+                first_saved_plan->prev_saved = NULL;
+        }
         else
         {
             CachedPlanSource *psrc;
@@ -501,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource)
                 if (psrc->next_saved == plansource)
                 {
                     psrc->next_saved = plansource->next_saved;
+                    if (psrc->next_saved)
+                        psrc->next_saved->prev_saved = psrc;
                     break;
                 }
             }
         }
+
+        if (last_saved_plan == plansource)
+        {
+            last_saved_plan = plansource->prev_saved;
+            if (last_saved_plan)
+                last_saved_plan->next_saved = NULL;
+        }
         plansource->is_saved = false;
     }
 
@@ -536,6 +588,11 @@ ReleaseGenericPlan(CachedPlanSource *plansource)
         Assert(plan->magic == CACHEDPLAN_MAGIC);
         plansource->gplan = NULL;
         ReleaseCachedPlan(plan, false);
+        if (plansource->is_saved)
+        {
+            Assert (num_saved_plans >= 1);
+            num_saved_plans--;
+        }
     }
 }
 
@@ -1146,6 +1203,15 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
     if (useResOwner && !plansource->is_saved)
         elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan");
 
+    /* increment access counter and set timestamp */
+    if (plansource->is_saved)
+    {
+        plansource->last_access = GetCatCacheClock();
+
+        /* move this plan to the first of the list if needed */
+        MovePlansourceToFirst(plansource);
+    }
+
     /* Make sure the querytree list is valid and we have parse-time locks */
     qlist = RevalidateCachedQuery(plansource, queryEnv);
 
@@ -1154,6 +1220,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
 
     if (!customplan)
     {
+        /* Prune cached plans if needed */
+        if (plansource->is_saved &&
+            (min_cached_plans < 0 || num_saved_plans > min_cached_plans))
+                PruneCachedPlan();
+
         if (CheckCachedPlan(plansource))
         {
             /* We want a generic plan, and we already have a valid one */
@@ -1166,6 +1237,12 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
             plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv);
             /* Just make real sure plansource->gplan is clear */
             ReleaseGenericPlan(plansource);
+
+
+            /* Prune cached plans if needed */
+            if (plansource->is_saved)
+                num_saved_plans++;
+
             /* Link the new generic plan into the plansource */
             plansource->gplan = plan;
             plan->refcount++;
@@ -1853,6 +1930,86 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue)
     ResetPlanCache();
 }
 
+/*
+ * PrunePlanCache: invalidate "old" cached plans.
+ */
+static void
+PruneCachedPlan(void)
+{
+    CachedPlanSource *plansource;
+    TimestampTz          currclock = GetCatCacheClock();
+    long              age;
+    int                  us;
+    int                  nremoved = 0;
+
+    /* do nothing if not wanted */
+    if (plancache_prune_min_age < 0 || num_saved_plans <= min_cached_plans)
+        return;
+
+    /* Fast check for oldest cache */
+    if (oldest_saved_plan > 0)
+    {
+        TimestampDifference(oldest_saved_plan, currclock, &age, &us);
+        if (age < plancache_prune_min_age)
+            return;
+    }        
+
+    /* last plan is the oldest. */
+    for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved)
+    {
+        long    plan_age;
+        int        us;
+
+        Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC);
+
+        /*
+         * No work if it already doesn't have gplan and move it to the
+         * beginning so that we don't see it at the next time
+         */
+        if (!plansource->gplan)
+            continue;
+
+        /*
+         * Check age for pruning. Can exit immediately when finding a
+         * not-older element.
+         */
+        TimestampDifference(plansource->last_access, currclock, &plan_age, &us);
+        if (plan_age <= plancache_prune_min_age)
+        {
+            /* this entry is the next oldest */
+            oldest_saved_plan = plansource->last_access;
+            break;
+        }
+
+        /*
+         * Here, remove generic plans of this plansrouceif it is not actually
+         * used and move it to the beginning of the list. Just update
+         * last_access and move it to the beginning if the plan is used.
+         */
+        if (plansource->gplan->refcount <= 1)
+        {
+            ReleaseGenericPlan(plansource);
+            nremoved++;
+        }
+
+        plansource->last_access = currclock;
+    }
+
+    /* move the "removed" plansrouces to the beginning of the list */
+    if (plansource != last_saved_plan && plansource)
+    {
+        plansource->next_saved->prev_saved = NULL;
+        first_saved_plan->prev_saved = last_saved_plan;
+         last_saved_plan->next_saved = first_saved_plan;
+        first_saved_plan = plansource->next_saved;
+        plansource->next_saved = NULL;
+        last_saved_plan = plansource;
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans);
+}
+
 /*
  * ResetPlanCache: invalidate all cached plans.
  */
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 5a8b15652a..a5b4979662 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -187,6 +187,8 @@ struct HASHHDR
     int            nelem_alloc;    /* number of entries to allocate at once */
     bool        prunable;        /* true if prunable */
     HASH_PRUNE_CB    prune_cb;    /* pruning callback. see above. */
+    int           *memory_target;    /* pointer to memory target */
+    int           *prune_min_age;    /* pointer to prune minimum age */
 
 #ifdef HASH_STATISTICS
 
@@ -510,6 +512,14 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
     {
         hctl->prunable = true;
         hctl->prune_cb = info->prune_cb;
+        if (info->memory_target)
+            hctl->memory_target = info->memory_target;
+        else
+            hctl->memory_target = &syscache_memory_target;
+        if (info->prune_min_age)
+            hctl->prune_min_age = info->prune_min_age;
+        else
+            hctl->prune_min_age = &syscache_prune_min_age;
     }
     else
         hctl->prunable = false;
@@ -1654,7 +1664,7 @@ prune_entries(HTAB *hashp)
         has_seq_scans(hashp));
 
     /* This setting prevents pruning */
-    if (syscache_prune_min_age < 0)
+    if (*hctl->prune_min_age < 0)
         return false;
 
     /*
@@ -1663,7 +1673,7 @@ prune_entries(HTAB *hashp)
      * settings is shared with syscache
      */
     if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize <
-        (Size) syscache_memory_target * 1024L)
+        (Size) *hctl->memory_target * 1024L)
         return false;
 
     /*
@@ -1683,7 +1693,7 @@ prune_entries(HTAB *hashp)
         TimestampDifference(helm->last_access, currclock, &entry_age, &us);
 
         /* settings is shared with syscache */
-        if (entry_age > syscache_prune_min_age)
+        if (entry_age > *hctl->prune_min_age)
         {
             /* Wait for the next chance if this is recently used */
             if (helm->naccess > 0)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5e0d18657f..45aab61d62 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1995,6 +1995,27 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"min_cached_plans", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum number of cached plans kept on memory."),
+            gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches
thisvalue. -1 means timeout invalidation is always active.")
 
+        },
+        &min_cached_plans,
+        1000, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"plancache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum duration of plancache entries to remove."),
+            gettext_noop("Plancache items that live unused for loger than this seconds are considered to be
removed."),
+             GUC_UNIT_S
+        },
+        &plancache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index d7baa54808..db225a06da 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -194,6 +194,7 @@ MemoryContextResetChildren(MemoryContext context)
  * but we have to recurse to handle the children.
  * We must also delink the context from its parent, if it has one.
  */
+int hoge = 0;
 void
 MemoryContextDelete(MemoryContext context)
 {
diff --git a/src/include/commands/prepare.h b/src/include/commands/prepare.h
index ffec029df4..1a8e8dd50e 100644
--- a/src/include/commands/prepare.h
+++ b/src/include/commands/prepare.h
@@ -31,6 +31,10 @@ typedef struct
     CachedPlanSource *plansource;    /* the actual cached plan */
     bool        from_sql;        /* prepared via SQL, not FE/BE protocol? */
     TimestampTz prepare_time;    /* the time when the stmt was prepared */
+    RawStmt       *raw_stmt;
+    int            num_params;
+    Oid           *param_types;
+    List       *query_list;
 } PreparedStatement;
 
 
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index df12352a46..7ea3c75423 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -93,6 +93,8 @@ typedef struct HASHCTL
     MemoryContext hcxt;            /* memory context to use for allocations */
     HASHHDR    *hctl;            /* location of header in shared mem */
     HASH_PRUNE_CB    prune_cb;    /* pruning callback. see above. */
+    int           *memory_target;    /* pointer to memory target */
+    int           *prune_min_age;    /* pointer to prune minimum age */
 } HASHCTL;
 
 /* Flags to indicate which parameters are supplied */
diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h
index ab20aa04b0..b5d439985c 100644
--- a/src/include/utils/plancache.h
+++ b/src/include/utils/plancache.h
@@ -72,10 +72,11 @@ struct RawStmt;
  * is no way to free memory short of clearing that entire context.  A oneshot
  * plan is always treated as unsaved.
  *
- * Note: the string referenced by commandTag is not subsidiary storage;
- * it is assumed to be a compile-time-constant string.  As with portals,
- * commandTag shall be NULL if and only if the original query string (before
- * rewriting) was an empty string.
+ * Note: the string referenced by commandTag is not subsidiary storage; it is
+ * assumed to be a compile-time-constant string.  As with portals, commandTag
+ * shall be NULL if and only if the original query string (before rewriting)
+ * was an empty string. For memory-saving purpose, this struct is separated
+ * into to parts, the latter is removable in inactive state.
  */
 typedef struct CachedPlanSource
 {
@@ -110,11 +111,13 @@ typedef struct CachedPlanSource
     bool        is_valid;        /* is the query_list currently valid? */
     int            generation;        /* increments each time we create a plan */
     /* If CachedPlanSource has been saved, it is a member of a global list */
+    struct CachedPlanSource *prev_saved;    /* list link, if so */
     struct CachedPlanSource *next_saved;    /* list link, if so */
     /* State kept to help decide whether to use custom or generic plans: */
     double        generic_cost;    /* cost of generic plan, or -1 if not known */
     double        total_custom_cost;    /* total cost of custom plans so far */
     int            num_custom_plans;    /* number of plans included in total */
+    TimestampTz    last_access;    /* timestamp of the last usage */
 } CachedPlanSource;
 
 /*
@@ -143,6 +146,9 @@ typedef struct CachedPlan
     MemoryContext context;        /* context containing this CachedPlan */
 } CachedPlan;
 
+/* GUC variables */
+extern int min_cached_plans;
+extern int plancache_prune_min_age;
 
 extern void InitPlanCache(void);
 extern void ResetPlanCache(void);
-- 
2.16.2


Re: Re: Protect syscache from bloating with negative cache entries

От
David Steele
Дата:
On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote:
> At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in> 
 
> The attached is the patch set including this plancache stuff.
> 
> 0001- catcache time-based expiration (The origin of this thread)
> 0002- introduces dynahash pruning feature
> 0003- implement relcache pruning using 0002
> 0004- (perhaps) independent from the three above. PoC of
>       plancache pruning. Details are shown above.

It looks like this should be marked Needs Review so I have done so.  If
that's not right please change it back or let me know and I will.

Regards,
-- 
-David
david@pgmasters.net


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Wed, 21 Mar 2018 15:28:07 -0400, David Steele <david@pgmasters.net> wrote in
<43095b16-14fc-e4d8-3310-2b86eaaab662@pgmasters.net>
> On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote:
> > At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in> 
 
> > The attached is the patch set including this plancache stuff.
> > 
> > 0001- catcache time-based expiration (The origin of this thread)
> > 0002- introduces dynahash pruning feature
> > 0003- implement relcache pruning using 0002
> > 0004- (perhaps) independent from the three above. PoC of
> >       plancache pruning. Details are shown above.
> 
> It looks like this should be marked Needs Review so I have done so.  If
> that's not right please change it back or let me know and I will.

Mmm. I haven't noticed that. Thanks!

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2018-03-23 17:01:11 +0900, Kyotaro HORIGUCHI wrote:
> Hello.
> 
> At Wed, 21 Mar 2018 15:28:07 -0400, David Steele <david@pgmasters.net> wrote in
<43095b16-14fc-e4d8-3310-2b86eaaab662@pgmasters.net>
> > On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote:
> > > At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
wrotein > 
 
> > > The attached is the patch set including this plancache stuff.
> > > 
> > > 0001- catcache time-based expiration (The origin of this thread)
> > > 0002- introduces dynahash pruning feature
> > > 0003- implement relcache pruning using 0002
> > > 0004- (perhaps) independent from the three above. PoC of
> > >       plancache pruning. Details are shown above.
> > 
> > It looks like this should be marked Needs Review so I have done so.  If
> > that's not right please change it back or let me know and I will.
> 
> Mmm. I haven't noticed that. Thanks!

I actually think this should be marked as returned with feedback, or at
the very least moved to the next CF.  This is entirely new development
within the last CF. There's no realistic way we can get this into v11.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Thu, 29 Mar 2018 18:22:59 -0700, Andres Freund <andres@anarazel.de> wrote in
<20180330012259.7k3442yz7jighg2t@alap3.anarazel.de>
> On 2018-03-23 17:01:11 +0900, Kyotaro HORIGUCHI wrote:
> > Hello.
> > 
> > At Wed, 21 Mar 2018 15:28:07 -0400, David Steele <david@pgmasters.net> wrote in
<43095b16-14fc-e4d8-3310-2b86eaaab662@pgmasters.net>
> > > On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote:
> > > > At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
wrotein > 
 
> > > > The attached is the patch set including this plancache stuff.
> > > > 
> > > > 0001- catcache time-based expiration (The origin of this thread)
> > > > 0002- introduces dynahash pruning feature
> > > > 0003- implement relcache pruning using 0002
> > > > 0004- (perhaps) independent from the three above. PoC of
> > > >       plancache pruning. Details are shown above.
> > > 
> > > It looks like this should be marked Needs Review so I have done so.  If
> > > that's not right please change it back or let me know and I will.
> > 
> > Mmm. I haven't noticed that. Thanks!
> 
> I actually think this should be marked as returned with feedback, or at
> the very least moved to the next CF.  This is entirely new development
> within the last CF. There's no realistic way we can get this into v11.

0002-0004 is new, in response to the comment that caches other
than the catcache ought to get the same feature. These can be a
separate development from 0001 for v12. I don't find a measures
to catch the all case at once.

If we agree on the point. I wish to discuss only 0001 for v11.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
Hi,

On 2018-03-30 10:35:48 +0900, Kyotaro HORIGUCHI wrote:
> 0002-0004 is new, in response to the comment that caches other
> than the catcache ought to get the same feature. These can be a
> separate development from 0001 for v12. I don't find a measures
> to catch the all case at once.
> 
> If we agree on the point. I wish to discuss only 0001 for v11.

I'd personally not want to commit a solution for catcaches without also
commiting a solution for a least relcaches in the same release cycle. I
think this patch simply has missed the window for v11.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Thu, 29 Mar 2018 18:51:45 -0700, Andres Freund <andres@anarazel.de> wrote in
<20180330015145.pvsr6kjtf6tw4uwe@alap3.anarazel.de>
> Hi,
> 
> On 2018-03-30 10:35:48 +0900, Kyotaro HORIGUCHI wrote:
> > 0002-0004 is new, in response to the comment that caches other
> > than the catcache ought to get the same feature. These can be a
> > separate development from 0001 for v12. I don't find a measures
> > to catch the all case at once.
> > 
> > If we agree on the point. I wish to discuss only 0001 for v11.
> 
> I'd personally not want to commit a solution for catcaches without also
> commiting a solution for a least relcaches in the same release cycle. I
> think this patch simply has missed the window for v11.

Ok. Agreed. I moved this to the next CF.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello. I rebased this patchset.

At Thu, 15 Mar 2018 14:12:46 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180315.141246.130742928.horiguchi.kyotaro@lab.ntt.co.jp>
> At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20180312.173408.162882093.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > > In short, it's not really apparent to me that negative syscache entries
> > > > are the major problem of this kind.  I'm afraid that you're drawing very
> > > > large conclusions from a specific workload.  Maybe we could fix that
> > > > workload some other way.
> > > 
> > > The current patch doesn't consider whether an entry is negative
> > > or positive(?). Just clean up all entries based on time.
> > > 
> > > If relation has to have the same characterictics to syscaches, it
> > > might be better be on the catcache mechanism, instaed of adding
> > > the same pruning mechanism to dynahash..

This means unifying catcache and dynahash. It doesn't seem
win-win consolidation. Addition to that relcache links palloc'ed
memory which needs additional treat.

Or we could abstract the pruning mechanism applicable to both
machinaries. Specifically unifying CatCacheCleanupOldEntries in
0001 and prune_entries in 0002. Or could refactor dynahash and
rebuild catcache based on dynahash.

> > For the moment, I added such feature to dynahash and let only
> > relcache use it in this patch. Hash element has different shape
> > in "prunable" hash and pruning is performed in a similar way
> > sharing the setting with syscache. This seems working fine.
> 
> I gave consideration on plancache. The most different
> characteristics from catcache and relcache is the fact that it is
> not voluntarily removable since CachedPlanSource, the root struct
> of a plan cache, holds some indispensable inforamtion. In regards
> to prepared queries, even if we store the information into
> another location, for example in "Prepred Queries" hash, it
> merely moving a big data into another place.
> 
> Looking into CachedPlanSoruce, generic plan is a part that is
> safely removable since it is rebuilt as necessary. Keeping "old"
> plancache entries not holding a generic plan can reduce memory
> usage.
> 
> For testing purpose, I made 50000 parepared statement like
> "select sum(c) from p where e < $" on 100 partitions,
> 
> With disabling the feature (0004 patch) VSZ of the backend
> exceeds 3GB (It is still increasing at the moment), while it
> stops to increase at about 997MB for min_cached_plans = 1000 and
> plancache_prune_min_age = '10s'.
> 
> # 10s is apparently short for acutual use, of course.
> 
> It is expected to be significant amount if the plan is large
> enough but I'm still not sure it is worth doing, or is a right
> way.
> 
> 
> The attached is the patch set including this plancache stuff.
> 
> 0001- catcache time-based expiration (The origin of this thread)
> 0002- introduces dynahash pruning feature
> 0003- implement relcache pruning using 0002
> 0004- (perhaps) independent from the three above. PoC of
>       plancache pruning. Details are shown above.

I found up to v3 in this thread so I named this version 4.

regards.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 842f7b9fd47c6ee4daf1316547679d4298538940 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 15 Mar 2018 12:04:43 +0900
Subject: [PATCH 4/4] Generic plan removal of PlanCacheSource.

We cannot remove saved cached plans while pruning since they are
pointed from other structures. But still we can remove generic plan of
each saved plans. The behavior is controled by two additional GUC
variables min_cached_plans and cache_prune_min_age. The former tells
to keep that number of generic plans without pruned. The latter tells
how long we shuold keep generic plans before pruning.
---
 src/backend/utils/cache/plancache.c | 163 ++++++++++++++++++++++++++++++++++++
 src/backend/utils/misc/guc.c        |  10 +++
 src/include/utils/plancache.h       |   7 +-
 3 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c
index 0ad3e3c736..701ead152c 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -63,12 +63,14 @@
 #include "storage/lmgr.h"
 #include "tcop/pquery.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
 #include "utils/rls.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
 /*
@@ -86,6 +88,12 @@
  * guarantee to save a CachedPlanSource without error.
  */
 static CachedPlanSource *first_saved_plan = NULL;
+static CachedPlanSource *last_saved_plan = NULL;
+static int                 num_saved_plans = 0;
+static TimestampTz         oldest_saved_plan = 0;
+
+/* GUC variables */
+int                         min_cached_plans = 1000;
 
 static void ReleaseGenericPlan(CachedPlanSource *plansource);
 static List *RevalidateCachedQuery(CachedPlanSource *plansource,
@@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list);
 static void PlanCacheRelCallback(Datum arg, Oid relid);
 static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue);
 static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue);
+static void PruneCachedPlan(void);
 
 
 /*
@@ -208,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree,
     plansource->generic_cost = -1;
     plansource->total_custom_cost = 0;
     plansource->num_custom_plans = 0;
+    plansource->last_access = GetCatCacheClock();
+    
 
     MemoryContextSwitchTo(oldcxt);
 
@@ -423,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource,
     plansource->is_valid = true;
 }
 
+/* moves the plansource to the first in the list */
+static inline void
+MovePlansourceToFirst(CachedPlanSource *plansource)
+{
+    if (first_saved_plan != plansource)
+    {
+        /* delink this element */
+        if (plansource->next_saved)
+            plansource->next_saved->prev_saved = plansource->prev_saved;
+        if (plansource->prev_saved)
+            plansource->prev_saved->next_saved = plansource->next_saved;
+        if (last_saved_plan == plansource)
+            last_saved_plan = plansource->prev_saved;
+
+        /* insert at the beginning */
+        first_saved_plan->prev_saved = plansource;
+        plansource->next_saved = first_saved_plan;
+        plansource->prev_saved = NULL;
+        first_saved_plan = plansource;
+    }
+}
+
 /*
  * SaveCachedPlan: save a cached plan permanently
  *
@@ -470,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource)
      * Add the entry to the global list of cached plans.
      */
     plansource->next_saved = first_saved_plan;
+    if (first_saved_plan)
+        first_saved_plan->prev_saved = plansource;
+    else
+        last_saved_plan = plansource;
+    plansource->prev_saved = NULL;
     first_saved_plan = plansource;
 
     plansource->is_saved = true;
@@ -492,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource)
     if (plansource->is_saved)
     {
         if (first_saved_plan == plansource)
+        {
             first_saved_plan = plansource->next_saved;
+            if (first_saved_plan)
+                first_saved_plan->prev_saved = NULL;
+        }
         else
         {
             CachedPlanSource *psrc;
@@ -502,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource)
                 if (psrc->next_saved == plansource)
                 {
                     psrc->next_saved = plansource->next_saved;
+                    if (psrc->next_saved)
+                        psrc->next_saved->prev_saved = psrc;
                     break;
                 }
             }
         }
+
+        if (last_saved_plan == plansource)
+        {
+            last_saved_plan = plansource->prev_saved;
+            if (last_saved_plan)
+                last_saved_plan->next_saved = NULL;
+        }
         plansource->is_saved = false;
     }
 
@@ -537,6 +588,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource)
         Assert(plan->magic == CACHEDPLAN_MAGIC);
         plansource->gplan = NULL;
         ReleaseCachedPlan(plan, false);
+
+        /* decrement "saved plans" counter */
+        if (plansource->is_saved)
+        {
+            Assert (num_saved_plans > 0);
+            num_saved_plans--;
+        }
     }
 }
 
@@ -1148,6 +1206,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
     if (useResOwner && !plansource->is_saved)
         elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan");
 
+    /*
+     * set last-accessed timestamp and move this plan to the first of the list
+     */
+    if (plansource->is_saved)
+    {
+        plansource->last_access = GetCatCacheClock();
+
+        /* move this plan to the first of the list */
+        MovePlansourceToFirst(plansource);
+    }
+
     /* Make sure the querytree list is valid and we have parse-time locks */
     qlist = RevalidateCachedQuery(plansource, queryEnv);
 
@@ -1156,6 +1225,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
 
     if (!customplan)
     {
+        /* Prune cached plans if needed */
+        if (plansource->is_saved &&
+            min_cached_plans >= 0 && num_saved_plans > min_cached_plans)
+                PruneCachedPlan();
+
         if (CheckCachedPlan(plansource))
         {
             /* We want a generic plan, and we already have a valid one */
@@ -1168,6 +1242,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
             plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv);
             /* Just make real sure plansource->gplan is clear */
             ReleaseGenericPlan(plansource);
+
+            /* count this new saved plan */
+            if (plansource->is_saved)
+                num_saved_plans++;
+
             /* Link the new generic plan into the plansource */
             plansource->gplan = plan;
             plan->refcount++;
@@ -1856,6 +1935,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue)
     ResetPlanCache();
 }
 
+/*
+ * PrunePlanCache: removes generic plan of "old" saved plans.
+ */
+static void
+PruneCachedPlan(void)
+{
+    CachedPlanSource *plansource;
+    TimestampTz          currclock = GetCatCacheClock();
+    long              age;
+    int                  us;
+    int                  nremoved = 0;
+
+    /* do nothing if not wanted */
+    if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans)
+        return;
+
+    /* Fast check for oldest cache */
+    if (oldest_saved_plan > 0)
+    {
+        TimestampDifference(oldest_saved_plan, currclock, &age, &us);
+        if (age < cache_prune_min_age)
+            return;
+    }        
+
+    /* last plan is the oldest. */
+    for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved)
+    {
+        long    plan_age;
+        int        us;
+
+        Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC);
+
+        /* we want to prune no more plans */
+        if (num_saved_plans <= min_cached_plans)
+            break;
+
+        /*
+         * No work if it already doesn't have gplan and move it to the
+         * beginning so that we don't see it at the next time
+         */
+        if (!plansource->gplan)
+            continue;
+
+        /*
+         * Check age for pruning. Can exit immediately when finding a
+         * not-older element.
+         */
+        TimestampDifference(plansource->last_access, currclock, &plan_age, &us);
+        if (plan_age <= cache_prune_min_age)
+        {
+            /* this entry is the next oldest */
+            oldest_saved_plan = plansource->last_access;
+            break;
+        }
+
+        /*
+         * Here, remove generic plans of this plansrouceif it is not actually
+         * used and move it to the beginning of the list. Just update
+         * last_access and move it to the beginning if the plan is used.
+         */
+        if (plansource->gplan->refcount <= 1)
+        {
+            ReleaseGenericPlan(plansource);
+            nremoved++;
+        }
+
+        plansource->last_access = currclock;
+    }
+
+    /* move the "removed" plansrouces altogehter to the beginning of the list */
+    if (plansource != last_saved_plan && plansource)
+    {
+        plansource->next_saved->prev_saved = NULL;
+        first_saved_plan->prev_saved = last_saved_plan;
+         last_saved_plan->next_saved = first_saved_plan;
+        first_saved_plan = plansource->next_saved;
+        plansource->next_saved = NULL;
+        last_saved_plan = plansource;
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans);
+}
+
 /*
  * ResetPlanCache: invalidate all cached plans.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9800252965..478bfe96a4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2128,6 +2128,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"min_cached_plans", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum number of cached plans kept on memory."),
+            gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches
thisvalue. -1 means timeout invalidation is always active.")
 
+        },
+        &min_cached_plans,
+        1000, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h
index ab20aa04b0..f3c5b2010d 100644
--- a/src/include/utils/plancache.h
+++ b/src/include/utils/plancache.h
@@ -110,11 +110,13 @@ typedef struct CachedPlanSource
     bool        is_valid;        /* is the query_list currently valid? */
     int            generation;        /* increments each time we create a plan */
     /* If CachedPlanSource has been saved, it is a member of a global list */
-    struct CachedPlanSource *next_saved;    /* list link, if so */
+    struct CachedPlanSource *prev_saved;    /* list prev link, if so */
+    struct CachedPlanSource *next_saved;    /* list next link, if so */
     /* State kept to help decide whether to use custom or generic plans: */
     double        generic_cost;    /* cost of generic plan, or -1 if not known */
     double        total_custom_cost;    /* total cost of custom plans so far */
     int            num_custom_plans;    /* number of plans included in total */
+    TimestampTz    last_access;    /* timestamp of the last usage */
 } CachedPlanSource;
 
 /*
@@ -143,6 +145,9 @@ typedef struct CachedPlan
     MemoryContext context;        /* context containing this CachedPlan */
 } CachedPlan;
 
+/* GUC variables */
+extern int min_cached_plans;
+extern int plancache_prune_min_age;
 
 extern void InitPlanCache(void);
 extern void ResetPlanCache(void);
-- 
2.16.3

From 06bf577b5092a9fa443122bc8eef51284c6aa339 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Mar 2018 17:31:43 +0900
Subject: [PATCH 3/3] Apply purning to relcache

---
 src/backend/utils/cache/relcache.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d85dc92505..dbbf9855b0 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3442,6 +3442,26 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
 
 #define INITRELCACHESIZE        400
 
+/* callback function for hash pruning */
+static bool
+relcache_prune_cb(HTAB *hashp, void *ent)
+{
+    RelIdCacheEnt  *relent = (RelIdCacheEnt *) ent;
+    Relation        relation;
+
+    /* this relation is requested to be removed.  */
+    RelationIdCacheLookup(relent->reloid, relation);
+
+    /* don't remove if currently in use */
+    if (!RelationHasReferenceCountZero(relation))
+        return false;
+
+    /* otherwise we can forget it unconditionally */
+    RelationClearRelation(relation, false);
+
+    return true;
+}
+
 void
 RelationCacheInitialize(void)
 {
@@ -3459,8 +3479,11 @@ RelationCacheInitialize(void)
     MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RelIdCacheEnt);
+
+    /* use the same setting with syscache */
+    ctl.prune_cb = relcache_prune_cb;
     RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE,
-                                  &ctl, HASH_ELEM | HASH_BLOBS);
+                                  &ctl, HASH_ELEM | HASH_BLOBS | HASH_PRUNABLE);
 
     /*
      * relation mapper needs to be initialized too
-- 
2.16.3

From 0c0f8ff6e786dc20aa43636cad57f3713c0c89dd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Mar 2018 15:52:18 +0900
Subject: [PATCH 2/3] introduce dynhash pruning

---
 src/backend/utils/hash/dynahash.c | 166 +++++++++++++++++++++++++++++++++-----
 src/include/utils/catcache.h      |  12 +++
 src/include/utils/hsearch.h       |  21 ++++-
 3 files changed, 179 insertions(+), 20 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 785e0faffb..261f8d9577 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -88,6 +88,7 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "utils/catcache.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -184,6 +185,12 @@ struct HASHHDR
     long        ssize;            /* segment size --- must be power of 2 */
     int            sshift;            /* segment shift = log2(ssize) */
     int            nelem_alloc;    /* number of entries to allocate at once */
+    bool        prunable;        /* true if prunable */
+    HASH_PRUNE_CB    prune_cb;    /* function to call instead of just deleting */
+
+    /* These fields point to variables to control pruning */
+    int           *memory_target;    /* pointer to memory target value in kB */
+    int           *prune_min_age;    /* pointer to prune minimum age value in sec */
 
 #ifdef HASH_STATISTICS
 
@@ -227,16 +234,18 @@ struct HTAB
     int            sshift;            /* segment shift = log2(ssize) */
 };
 
+#define HASHELEMENT_SIZE(ctlp) MAXALIGN(ctlp->prunable ? sizeof(PRUNABLE_HASHELEMENT) : sizeof(HASHELEMENT))
+
 /*
  * Key (also entry) part of a HASHELEMENT
  */
-#define ELEMENTKEY(helem)  (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT)))
+#define ELEMENTKEY(helem, ctlp)  (((char *)(helem)) + HASHELEMENT_SIZE(ctlp))
 
 /*
  * Obtain element pointer given pointer to key
  */
-#define ELEMENT_FROM_KEY(key)  \
-    ((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT))))
+#define ELEMENT_FROM_KEY(key, ctlp)                                        \
+    ((HASHELEMENT *) (((char *) (key)) - HASHELEMENT_SIZE(ctlp)))
 
 /*
  * Fast MOD arithmetic, assuming that y is a power of 2 !
@@ -257,6 +266,7 @@ static HASHSEGMENT seg_alloc(HTAB *hashp);
 static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
+static bool prune_entries(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
 static void hdefault(HTAB *hashp);
 static int    choose_nelem_alloc(Size entrysize);
@@ -499,6 +509,29 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
         hctl->entrysize = info->entrysize;
     }
 
+    /*
+     * Set up pruning.
+     *
+     * We have two knobs to control pruning and a hash can share them of
+     * syscache.
+     *
+     */
+    if (flags & HASH_PRUNABLE)
+    {
+        hctl->prunable = true;
+        hctl->prune_cb = info->prune_cb;
+        if (info->memory_target)
+            hctl->memory_target = info->memory_target;
+        else
+            hctl->memory_target = &cache_memory_target;
+        if (info->prune_min_age)
+            hctl->prune_min_age = info->prune_min_age;
+        else
+            hctl->prune_min_age = &cache_prune_min_age;
+    }
+    else
+        hctl->prunable = false;
+
     /* make local copies of heavily-used constant fields */
     hashp->keysize = hctl->keysize;
     hashp->ssize = hctl->ssize;
@@ -984,7 +1017,7 @@ hash_search_with_hash_value(HTAB *hashp,
     while (currBucket != NULL)
     {
         if (currBucket->hashvalue == hashvalue &&
-            match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
+            match(ELEMENTKEY(currBucket, hctl), keyPtr, keysize) == 0)
             break;
         prevBucketPtr = &(currBucket->link);
         currBucket = *prevBucketPtr;
@@ -997,6 +1030,17 @@ hash_search_with_hash_value(HTAB *hashp,
     if (foundPtr)
         *foundPtr = (bool) (currBucket != NULL);
 
+    /* Update access counter if needed */
+    if (hctl->prunable && currBucket &&
+        (action == HASH_FIND || action == HASH_ENTER))
+    {
+        PRUNABLE_HASHELEMENT *prunable_elm =
+            (PRUNABLE_HASHELEMENT *) currBucket;
+        if (prunable_elm->naccess < 2)
+            prunable_elm->naccess++;
+        prunable_elm->last_access = GetCatCacheClock();
+    }
+
     /*
      * OK, now what?
      */
@@ -1004,7 +1048,8 @@ hash_search_with_hash_value(HTAB *hashp,
     {
         case HASH_FIND:
             if (currBucket != NULL)
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
+
             return NULL;
 
         case HASH_REMOVE:
@@ -1033,7 +1078,7 @@ hash_search_with_hash_value(HTAB *hashp,
                  * element, because someone else is going to reuse it the next
                  * time something is added to the table
                  */
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
             }
             return NULL;
 
@@ -1045,7 +1090,7 @@ hash_search_with_hash_value(HTAB *hashp,
         case HASH_ENTER:
             /* Return existing element if found, else create one */
             if (currBucket != NULL)
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
 
             /* disallow inserts if frozen */
             if (hashp->frozen)
@@ -1075,8 +1120,18 @@ hash_search_with_hash_value(HTAB *hashp,
 
             /* copy key into record */
             currBucket->hashvalue = hashvalue;
-            hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize);
+            hashp->keycopy(ELEMENTKEY(currBucket, hctl), keyPtr, keysize);
 
+            /* set access counter */
+            if (hctl->prunable)
+            {
+                PRUNABLE_HASHELEMENT *prunable_elm =
+                    (PRUNABLE_HASHELEMENT *) currBucket;
+                if (prunable_elm->naccess < 2)
+                    prunable_elm->naccess++;
+                prunable_elm->last_access = GetCatCacheClock();
+            }
+            
             /*
              * Caller is expected to fill the data field on return.  DO NOT
              * insert any code that could possibly throw error here, as doing
@@ -1084,7 +1139,7 @@ hash_search_with_hash_value(HTAB *hashp,
              * caller's data structure.
              */
 
-            return (void *) ELEMENTKEY(currBucket);
+            return (void *) ELEMENTKEY(currBucket, hctl);
     }
 
     elog(ERROR, "unrecognized hash action code: %d", (int) action);
@@ -1116,7 +1171,7 @@ hash_update_hash_key(HTAB *hashp,
                      void *existingEntry,
                      const void *newKeyPtr)
 {
-    HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry);
+    HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry, hashp->hctl);
     HASHHDR    *hctl = hashp->hctl;
     uint32        newhashvalue;
     Size        keysize;
@@ -1200,7 +1255,7 @@ hash_update_hash_key(HTAB *hashp,
     while (currBucket != NULL)
     {
         if (currBucket->hashvalue == newhashvalue &&
-            match(ELEMENTKEY(currBucket), newKeyPtr, keysize) == 0)
+            match(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize) == 0)
             break;
         prevBucketPtr = &(currBucket->link);
         currBucket = *prevBucketPtr;
@@ -1234,7 +1289,7 @@ hash_update_hash_key(HTAB *hashp,
 
     /* copy new key into record */
     currBucket->hashvalue = newhashvalue;
-    hashp->keycopy(ELEMENTKEY(currBucket), newKeyPtr, keysize);
+    hashp->keycopy(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize);
 
     /* rest of record is untouched */
 
@@ -1388,8 +1443,8 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp)
 void *
 hash_seq_search(HASH_SEQ_STATUS *status)
 {
-    HTAB       *hashp;
-    HASHHDR    *hctl;
+    HTAB       *hashp = status->hashp;
+    HASHHDR    *hctl = hashp->hctl;
     uint32        max_bucket;
     long        ssize;
     long        segment_num;
@@ -1404,15 +1459,13 @@ hash_seq_search(HASH_SEQ_STATUS *status)
         status->curEntry = curElem->link;
         if (status->curEntry == NULL)    /* end of this bucket */
             ++status->curBucket;
-        return (void *) ELEMENTKEY(curElem);
+        return (void *) ELEMENTKEY(curElem, hctl);
     }
 
     /*
      * Search for next nonempty bucket starting at curBucket.
      */
     curBucket = status->curBucket;
-    hashp = status->hashp;
-    hctl = hashp->hctl;
     ssize = hashp->ssize;
     max_bucket = hctl->max_bucket;
 
@@ -1458,7 +1511,7 @@ hash_seq_search(HASH_SEQ_STATUS *status)
     if (status->curEntry == NULL)    /* end of this bucket */
         ++curBucket;
     status->curBucket = curBucket;
-    return (void *) ELEMENTKEY(curElem);
+    return (void *) ELEMENTKEY(curElem, hctl);
 }
 
 void
@@ -1552,6 +1605,10 @@ expand_table(HTAB *hashp)
      */
     if ((uint32) new_bucket > hctl->high_mask)
     {
+        /* try pruning before expansion. return true on success */
+        if (hctl->prunable && prune_entries(hashp))
+            return true;
+
         hctl->low_mask = hctl->high_mask;
         hctl->high_mask = (uint32) new_bucket | hctl->low_mask;
     }
@@ -1594,6 +1651,77 @@ expand_table(HTAB *hashp)
     return true;
 }
 
+static bool
+prune_entries(HTAB *hashp)
+{
+    HASHHDR           *hctl = hashp->hctl;
+    HASH_SEQ_STATUS status;
+    void            *elm;
+    TimestampTz        currclock = GetCatCacheClock();
+    int                nall = 0,
+                    nremoved = 0;
+
+    Assert(hctl->prunable);
+
+    /* Return if pruning is currently disabled or not doable */
+    if (*hctl->prune_min_age < 0 || hashp->frozen || has_seq_scans(hashp))
+        return false;
+
+    /*
+     * we don't prune before reaching this size. We only consider bucket array
+     * size since it is the significant part of memory usage.
+     */
+    if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize <
+        (Size) *hctl->memory_target * 1024L)
+        return false;
+
+    /* Ok, start pruning. we can use seq scan here. */
+    hash_seq_init(&status, hashp);
+    while ((elm = hash_seq_search(&status)) != NULL)
+    {
+        PRUNABLE_HASHELEMENT *helm =
+            (PRUNABLE_HASHELEMENT *)ELEMENT_FROM_KEY(elm, hctl);
+        long    entry_age;
+        int        us;
+
+        nall++;
+
+        TimestampDifference(helm->last_access, currclock, &entry_age, &us);
+
+        /*
+         * consider pruning if this entry has not been accessed for a certain
+         * time
+         */
+        if (entry_age > *hctl->prune_min_age)
+        {
+            /* Wait for the next chance if this is recently used */
+            if (helm->naccess > 0)
+                helm->naccess--;
+            else
+            {
+                /* just call it if callback is provided, remove otherwise */
+                if (hctl->prune_cb)
+                {
+                    if (hctl->prune_cb(hashp, (void *)elm))
+                        nremoved++;
+                }
+                else
+                {
+                    bool found;
+                    
+                    hash_search(hashp, elm, HASH_REMOVE, &found);
+                    Assert(found);
+                    nremoved++;
+                }
+            }
+        }
+    }
+
+    elog(DEBUG1, "removed %d/%d entries from hash \"%s\"",
+         nremoved, nall, hashp->tabname);
+
+    return nremoved > 0;
+}
 
 static bool
 dir_realloc(HTAB *hashp)
@@ -1667,7 +1795,7 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
         return false;
 
     /* Each element has a HASHELEMENT header plus user data. */
-    elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+    elementSize = HASHELEMENT_SIZE(hctl) + MAXALIGN(hctl->entrysize);
 
     CurrentDynaHashCxt = hashp->hcxt;
     firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 599303be56..b3f73f53d2 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -208,6 +208,18 @@ SetCatCacheClock(TimestampTz ts)
     catcacheclock = ts;
 }
 
+/*
+ * GetCatCacheClock - get timestamp for catcache access record
+ *
+ * This clock is basically provided for catcache usage, but dynahash has a
+ * similar pruning mechanism and wants to use the same clock.
+ */
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 8357faac5a..6e9fa74a4f 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -13,7 +13,7 @@
  */
 #ifndef HSEARCH_H
 #define HSEARCH_H
-
+#include "datatype/timestamp.h"
 
 /*
  * Hash functions must have this signature.
@@ -47,6 +47,7 @@ typedef void *(*HashAllocFunc) (Size request);
  * HASHELEMENT is the private part of a hashtable entry.  The caller's data
  * follows the HASHELEMENT structure (on a MAXALIGN'd boundary).  The hash key
  * is expected to be at the start of the caller's hash entry data structure.
+ * If this hash is prunable, PRUNABLE_HASHELEMENT is used instead.
  */
 typedef struct HASHELEMENT
 {
@@ -54,12 +55,26 @@ typedef struct HASHELEMENT
     uint32        hashvalue;        /* hash function result for this entry */
 } HASHELEMENT;
 
+typedef struct PRUNABLE_HASHELEMENT
+{
+    struct HASHELEMENT *link;    /* link to next entry in same bucket */
+    uint32        hashvalue;        /* hash function result for this entry */
+    TimestampTz    last_access;    /* timestamp of last usage */
+    int            naccess;        /* takes 0 to 2, counted up when used */
+} PRUNABLE_HASHELEMENT;
+
 /* Hash table header struct is an opaque type known only within dynahash.c */
 typedef struct HASHHDR HASHHDR;
 
 /* Hash table control struct is an opaque type known only within dynahash.c */
 typedef struct HTAB HTAB;
 
+/*
+ * Hash pruning callback which is called for the entries which is about to be
+ * pruned and returns false if the entry shuold be kept.
+ */
+typedef bool (*HASH_PRUNE_CB)(HTAB *hashp, void *ent);
+
 /* Parameter data structure for hash_create */
 /* Only those fields indicated by hash_flags need be set */
 typedef struct HASHCTL
@@ -77,6 +92,9 @@ typedef struct HASHCTL
     HashAllocFunc alloc;        /* memory allocator */
     MemoryContext hcxt;            /* memory context to use for allocations */
     HASHHDR    *hctl;            /* location of header in shared mem */
+    HASH_PRUNE_CB    prune_cb;    /* pruning callback. see above. */
+    int           *memory_target;    /* pointer to memory target */
+    int           *prune_min_age;    /* pointer to prune minimum age */
 } HASHCTL;
 
 /* Flags to indicate which parameters are supplied */
@@ -94,6 +112,7 @@ typedef struct HASHCTL
 #define HASH_SHARED_MEM 0x0800    /* Hashtable is in shared memory */
 #define HASH_ATTACH        0x1000    /* Do not initialize hctl */
 #define HASH_FIXED_SIZE 0x2000    /* Initial size is a hard limit */
+#define HASH_PRUNABLE    0x4000    /* pruning setting */
 
 
 /* max_dsize value to indicate expansible directory */
-- 
2.16.3

From 870ca3f1403310493b2580314c8b1b478dbff028 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Dec 2017 17:43:09 +0900
Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   3 +
 src/backend/utils/cache/catcache.c            | 153 +++++++++++++++++++++++-
 src/backend/utils/cache/plancache.c           | 163 ++++++++++++++++++++++++++
 src/backend/utils/misc/guc.c                  |  33 ++++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  19 +++
 src/include/utils/plancache.h                 |   7 +-
 8 files changed, 413 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7bfbc87109..4ba4327007 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1617,6 +1617,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8e6aef332c..e4a4a5874c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -732,6 +732,9 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     stmtStartTimestamp = GetCurrentTimestamp();
+
+    /* Set this timestamp as aproximated current time */
+    SetCatCacheClock(stmtStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 5ddbf6eab1..9f421cd242 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -866,9 +881,130 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     * Since the area for bucket array is dominant, consider only it.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1282,6 +1418,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1813,7 +1954,6 @@ ReleaseCatCacheList(CatCList *list)
         CatCacheRemoveCList(list->my_cache, list);
 }
 
-
 /*
  * CatalogCacheCreateEntry
  *        Create a new CatCTup entry, copying the given HeapTuple and other
@@ -1906,6 +2046,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -1913,10 +2055,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CacheHdr->ch_ntup++;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c
index 0ad3e3c736..701ead152c 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -63,12 +63,14 @@
 #include "storage/lmgr.h"
 #include "tcop/pquery.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
 #include "utils/rls.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
 /*
@@ -86,6 +88,12 @@
  * guarantee to save a CachedPlanSource without error.
  */
 static CachedPlanSource *first_saved_plan = NULL;
+static CachedPlanSource *last_saved_plan = NULL;
+static int                 num_saved_plans = 0;
+static TimestampTz         oldest_saved_plan = 0;
+
+/* GUC variables */
+int                         min_cached_plans = 1000;
 
 static void ReleaseGenericPlan(CachedPlanSource *plansource);
 static List *RevalidateCachedQuery(CachedPlanSource *plansource,
@@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list);
 static void PlanCacheRelCallback(Datum arg, Oid relid);
 static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue);
 static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue);
+static void PruneCachedPlan(void);
 
 
 /*
@@ -208,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree,
     plansource->generic_cost = -1;
     plansource->total_custom_cost = 0;
     plansource->num_custom_plans = 0;
+    plansource->last_access = GetCatCacheClock();
+    
 
     MemoryContextSwitchTo(oldcxt);
 
@@ -423,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource,
     plansource->is_valid = true;
 }
 
+/* moves the plansource to the first in the list */
+static inline void
+MovePlansourceToFirst(CachedPlanSource *plansource)
+{
+    if (first_saved_plan != plansource)
+    {
+        /* delink this element */
+        if (plansource->next_saved)
+            plansource->next_saved->prev_saved = plansource->prev_saved;
+        if (plansource->prev_saved)
+            plansource->prev_saved->next_saved = plansource->next_saved;
+        if (last_saved_plan == plansource)
+            last_saved_plan = plansource->prev_saved;
+
+        /* insert at the beginning */
+        first_saved_plan->prev_saved = plansource;
+        plansource->next_saved = first_saved_plan;
+        plansource->prev_saved = NULL;
+        first_saved_plan = plansource;
+    }
+}
+
 /*
  * SaveCachedPlan: save a cached plan permanently
  *
@@ -470,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource)
      * Add the entry to the global list of cached plans.
      */
     plansource->next_saved = first_saved_plan;
+    if (first_saved_plan)
+        first_saved_plan->prev_saved = plansource;
+    else
+        last_saved_plan = plansource;
+    plansource->prev_saved = NULL;
     first_saved_plan = plansource;
 
     plansource->is_saved = true;
@@ -492,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource)
     if (plansource->is_saved)
     {
         if (first_saved_plan == plansource)
+        {
             first_saved_plan = plansource->next_saved;
+            if (first_saved_plan)
+                first_saved_plan->prev_saved = NULL;
+        }
         else
         {
             CachedPlanSource *psrc;
@@ -502,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource)
                 if (psrc->next_saved == plansource)
                 {
                     psrc->next_saved = plansource->next_saved;
+                    if (psrc->next_saved)
+                        psrc->next_saved->prev_saved = psrc;
                     break;
                 }
             }
         }
+
+        if (last_saved_plan == plansource)
+        {
+            last_saved_plan = plansource->prev_saved;
+            if (last_saved_plan)
+                last_saved_plan->next_saved = NULL;
+        }
         plansource->is_saved = false;
     }
 
@@ -537,6 +588,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource)
         Assert(plan->magic == CACHEDPLAN_MAGIC);
         plansource->gplan = NULL;
         ReleaseCachedPlan(plan, false);
+
+        /* decrement "saved plans" counter */
+        if (plansource->is_saved)
+        {
+            Assert (num_saved_plans > 0);
+            num_saved_plans--;
+        }
     }
 }
 
@@ -1148,6 +1206,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
     if (useResOwner && !plansource->is_saved)
         elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan");
 
+    /*
+     * set last-accessed timestamp and move this plan to the first of the list
+     */
+    if (plansource->is_saved)
+    {
+        plansource->last_access = GetCatCacheClock();
+
+        /* move this plan to the first of the list */
+        MovePlansourceToFirst(plansource);
+    }
+
     /* Make sure the querytree list is valid and we have parse-time locks */
     qlist = RevalidateCachedQuery(plansource, queryEnv);
 
@@ -1156,6 +1225,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
 
     if (!customplan)
     {
+        /* Prune cached plans if needed */
+        if (plansource->is_saved &&
+            min_cached_plans >= 0 && num_saved_plans > min_cached_plans)
+                PruneCachedPlan();
+
         if (CheckCachedPlan(plansource))
         {
             /* We want a generic plan, and we already have a valid one */
@@ -1168,6 +1242,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
             plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv);
             /* Just make real sure plansource->gplan is clear */
             ReleaseGenericPlan(plansource);
+
+            /* count this new saved plan */
+            if (plansource->is_saved)
+                num_saved_plans++;
+
             /* Link the new generic plan into the plansource */
             plansource->gplan = plan;
             plan->refcount++;
@@ -1856,6 +1935,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue)
     ResetPlanCache();
 }
 
+/*
+ * PrunePlanCache: removes generic plan of "old" saved plans.
+ */
+static void
+PruneCachedPlan(void)
+{
+    CachedPlanSource *plansource;
+    TimestampTz          currclock = GetCatCacheClock();
+    long              age;
+    int                  us;
+    int                  nremoved = 0;
+
+    /* do nothing if not wanted */
+    if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans)
+        return;
+
+    /* Fast check for oldest cache */
+    if (oldest_saved_plan > 0)
+    {
+        TimestampDifference(oldest_saved_plan, currclock, &age, &us);
+        if (age < cache_prune_min_age)
+            return;
+    }        
+
+    /* last plan is the oldest. */
+    for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved)
+    {
+        long    plan_age;
+        int        us;
+
+        Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC);
+
+        /* we want to prune no more plans */
+        if (num_saved_plans <= min_cached_plans)
+            break;
+
+        /*
+         * No work if it already doesn't have gplan and move it to the
+         * beginning so that we don't see it at the next time
+         */
+        if (!plansource->gplan)
+            continue;
+
+        /*
+         * Check age for pruning. Can exit immediately when finding a
+         * not-older element.
+         */
+        TimestampDifference(plansource->last_access, currclock, &plan_age, &us);
+        if (plan_age <= cache_prune_min_age)
+        {
+            /* this entry is the next oldest */
+            oldest_saved_plan = plansource->last_access;
+            break;
+        }
+
+        /*
+         * Here, remove generic plans of this plansrouceif it is not actually
+         * used and move it to the beginning of the list. Just update
+         * last_access and move it to the beginning if the plan is used.
+         */
+        if (plansource->gplan->refcount <= 1)
+        {
+            ReleaseGenericPlan(plansource);
+            nremoved++;
+        }
+
+        plansource->last_access = currclock;
+    }
+
+    /* move the "removed" plansrouces altogehter to the beginning of the list */
+    if (plansource != last_saved_plan && plansource)
+    {
+        plansource->next_saved->prev_saved = NULL;
+        first_saved_plan->prev_saved = last_saved_plan;
+         last_saved_plan->next_saved = first_saved_plan;
+        first_saved_plan = plansource->next_saved;
+        plansource->next_saved = NULL;
+        last_saved_plan = plansource;
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans);
+}
+
 /*
  * ResetPlanCache: invalidate all cached plans.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 859ef931e7..774a87ed2c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -79,6 +79,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
@@ -2105,6 +2106,38 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"min_cached_plans", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum number of cached plans kept on memory."),
+            gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches
thisvalue. -1 means timeout invalidation is always active.")
 
+        },
+        &min_cached_plans,
+        1000, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf466..3f2760ef9d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -126,6 +126,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7b22f9c7bc..599303be56 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -119,6 +120,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h
index ab20aa04b0..f3c5b2010d 100644
--- a/src/include/utils/plancache.h
+++ b/src/include/utils/plancache.h
@@ -110,11 +110,13 @@ typedef struct CachedPlanSource
     bool        is_valid;        /* is the query_list currently valid? */
     int            generation;        /* increments each time we create a plan */
     /* If CachedPlanSource has been saved, it is a member of a global list */
-    struct CachedPlanSource *next_saved;    /* list link, if so */
+    struct CachedPlanSource *prev_saved;    /* list prev link, if so */
+    struct CachedPlanSource *next_saved;    /* list next link, if so */
     /* State kept to help decide whether to use custom or generic plans: */
     double        generic_cost;    /* cost of generic plan, or -1 if not known */
     double        total_custom_cost;    /* total cost of custom plans so far */
     int            num_custom_plans;    /* number of plans included in total */
+    TimestampTz    last_access;    /* timestamp of the last usage */
 } CachedPlanSource;
 
 /*
@@ -143,6 +145,9 @@ typedef struct CachedPlan
     MemoryContext context;        /* context containing this CachedPlan */
 } CachedPlan;
 
+/* GUC variables */
+extern int min_cached_plans;
+extern int plancache_prune_min_age;
 
 extern void InitPlanCache(void);
 extern void ResetPlanCache(void);
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Andrew Dunstan
Дата:

On 06/26/2018 05:00 AM, Kyotaro HORIGUCHI wrote:
>
>> The attached is the patch set including this plancache stuff.
>>
>> 0001- catcache time-based expiration (The origin of this thread)
>> 0002- introduces dynahash pruning feature
>> 0003- implement relcache pruning using 0002
>> 0004- (perhaps) independent from the three above. PoC of
>>        plancache pruning. Details are shown above.
> I found up to v3 in this thread so I named this version 4.
>


Andres suggested back in March (and again privately to me) that given 
how much this has changed from the original this CF item should be 
marked Returned With Feedback and the current patchset submitted as a 
new item.

Does anyone object to that course of action?

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello. The previous v4 patchset was just broken.

At Tue, 26 Jun 2018 18:00:03 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180626.180003.127457941.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello. I rebased this patchset.
..
> > The attached is the patch set including this plancache stuff.
> > 
> > 0001- catcache time-based expiration (The origin of this thread)
> > 0002- introduces dynahash pruning feature
> > 0003- implement relcache pruning using 0002
> > 0004- (perhaps) independent from the three above. PoC of
> >       plancache pruning. Details are shown above.
> 
> I found up to v3 in this thread so I named this version 4.

Somehow the 0004 was merged into the 0003 and applying 0004
results in failure. I removed 0004 part from the 0003 and rebased
and repost it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e267985853b100a8ecfd10cc02f464f8c802d19e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 26 Dec 2017 17:43:09 +0900
Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   3 +
 src/backend/utils/cache/catcache.c            | 153 +++++++++++++++++++++++-
 src/backend/utils/cache/plancache.c           | 163 ++++++++++++++++++++++++++
 src/backend/utils/misc/guc.c                  |  33 ++++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  19 +++
 src/include/utils/plancache.h                 |   7 +-
 8 files changed, 413 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5b913f00c1..76745047af 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1617,6 +1617,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8e6aef332c..e4a4a5874c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -732,6 +732,9 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     stmtStartTimestamp = GetCurrentTimestamp();
+
+    /* Set this timestamp as aproximated current time */
+    SetCatCacheClock(stmtStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 5ddbf6eab1..9f421cd242 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -866,9 +881,130 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     * Since the area for bucket array is dominant, consider only it.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1282,6 +1418,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1813,7 +1954,6 @@ ReleaseCatCacheList(CatCList *list)
         CatCacheRemoveCList(list->my_cache, list);
 }
 
-
 /*
  * CatalogCacheCreateEntry
  *        Create a new CatCTup entry, copying the given HeapTuple and other
@@ -1906,6 +2046,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -1913,10 +2055,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CacheHdr->ch_ntup++;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c
index 0ad3e3c736..701ead152c 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -63,12 +63,14 @@
 #include "storage/lmgr.h"
 #include "tcop/pquery.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
 #include "utils/rls.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
 /*
@@ -86,6 +88,12 @@
  * guarantee to save a CachedPlanSource without error.
  */
 static CachedPlanSource *first_saved_plan = NULL;
+static CachedPlanSource *last_saved_plan = NULL;
+static int                 num_saved_plans = 0;
+static TimestampTz         oldest_saved_plan = 0;
+
+/* GUC variables */
+int                         min_cached_plans = 1000;
 
 static void ReleaseGenericPlan(CachedPlanSource *plansource);
 static List *RevalidateCachedQuery(CachedPlanSource *plansource,
@@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list);
 static void PlanCacheRelCallback(Datum arg, Oid relid);
 static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue);
 static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue);
+static void PruneCachedPlan(void);
 
 
 /*
@@ -208,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree,
     plansource->generic_cost = -1;
     plansource->total_custom_cost = 0;
     plansource->num_custom_plans = 0;
+    plansource->last_access = GetCatCacheClock();
+    
 
     MemoryContextSwitchTo(oldcxt);
 
@@ -423,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource,
     plansource->is_valid = true;
 }
 
+/* moves the plansource to the first in the list */
+static inline void
+MovePlansourceToFirst(CachedPlanSource *plansource)
+{
+    if (first_saved_plan != plansource)
+    {
+        /* delink this element */
+        if (plansource->next_saved)
+            plansource->next_saved->prev_saved = plansource->prev_saved;
+        if (plansource->prev_saved)
+            plansource->prev_saved->next_saved = plansource->next_saved;
+        if (last_saved_plan == plansource)
+            last_saved_plan = plansource->prev_saved;
+
+        /* insert at the beginning */
+        first_saved_plan->prev_saved = plansource;
+        plansource->next_saved = first_saved_plan;
+        plansource->prev_saved = NULL;
+        first_saved_plan = plansource;
+    }
+}
+
 /*
  * SaveCachedPlan: save a cached plan permanently
  *
@@ -470,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource)
      * Add the entry to the global list of cached plans.
      */
     plansource->next_saved = first_saved_plan;
+    if (first_saved_plan)
+        first_saved_plan->prev_saved = plansource;
+    else
+        last_saved_plan = plansource;
+    plansource->prev_saved = NULL;
     first_saved_plan = plansource;
 
     plansource->is_saved = true;
@@ -492,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource)
     if (plansource->is_saved)
     {
         if (first_saved_plan == plansource)
+        {
             first_saved_plan = plansource->next_saved;
+            if (first_saved_plan)
+                first_saved_plan->prev_saved = NULL;
+        }
         else
         {
             CachedPlanSource *psrc;
@@ -502,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource)
                 if (psrc->next_saved == plansource)
                 {
                     psrc->next_saved = plansource->next_saved;
+                    if (psrc->next_saved)
+                        psrc->next_saved->prev_saved = psrc;
                     break;
                 }
             }
         }
+
+        if (last_saved_plan == plansource)
+        {
+            last_saved_plan = plansource->prev_saved;
+            if (last_saved_plan)
+                last_saved_plan->next_saved = NULL;
+        }
         plansource->is_saved = false;
     }
 
@@ -537,6 +588,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource)
         Assert(plan->magic == CACHEDPLAN_MAGIC);
         plansource->gplan = NULL;
         ReleaseCachedPlan(plan, false);
+
+        /* decrement "saved plans" counter */
+        if (plansource->is_saved)
+        {
+            Assert (num_saved_plans > 0);
+            num_saved_plans--;
+        }
     }
 }
 
@@ -1148,6 +1206,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
     if (useResOwner && !plansource->is_saved)
         elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan");
 
+    /*
+     * set last-accessed timestamp and move this plan to the first of the list
+     */
+    if (plansource->is_saved)
+    {
+        plansource->last_access = GetCatCacheClock();
+
+        /* move this plan to the first of the list */
+        MovePlansourceToFirst(plansource);
+    }
+
     /* Make sure the querytree list is valid and we have parse-time locks */
     qlist = RevalidateCachedQuery(plansource, queryEnv);
 
@@ -1156,6 +1225,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
 
     if (!customplan)
     {
+        /* Prune cached plans if needed */
+        if (plansource->is_saved &&
+            min_cached_plans >= 0 && num_saved_plans > min_cached_plans)
+                PruneCachedPlan();
+
         if (CheckCachedPlan(plansource))
         {
             /* We want a generic plan, and we already have a valid one */
@@ -1168,6 +1242,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
             plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv);
             /* Just make real sure plansource->gplan is clear */
             ReleaseGenericPlan(plansource);
+
+            /* count this new saved plan */
+            if (plansource->is_saved)
+                num_saved_plans++;
+
             /* Link the new generic plan into the plansource */
             plansource->gplan = plan;
             plan->refcount++;
@@ -1856,6 +1935,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue)
     ResetPlanCache();
 }
 
+/*
+ * PrunePlanCache: removes generic plan of "old" saved plans.
+ */
+static void
+PruneCachedPlan(void)
+{
+    CachedPlanSource *plansource;
+    TimestampTz          currclock = GetCatCacheClock();
+    long              age;
+    int                  us;
+    int                  nremoved = 0;
+
+    /* do nothing if not wanted */
+    if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans)
+        return;
+
+    /* Fast check for oldest cache */
+    if (oldest_saved_plan > 0)
+    {
+        TimestampDifference(oldest_saved_plan, currclock, &age, &us);
+        if (age < cache_prune_min_age)
+            return;
+    }        
+
+    /* last plan is the oldest. */
+    for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved)
+    {
+        long    plan_age;
+        int        us;
+
+        Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC);
+
+        /* we want to prune no more plans */
+        if (num_saved_plans <= min_cached_plans)
+            break;
+
+        /*
+         * No work if it already doesn't have gplan and move it to the
+         * beginning so that we don't see it at the next time
+         */
+        if (!plansource->gplan)
+            continue;
+
+        /*
+         * Check age for pruning. Can exit immediately when finding a
+         * not-older element.
+         */
+        TimestampDifference(plansource->last_access, currclock, &plan_age, &us);
+        if (plan_age <= cache_prune_min_age)
+        {
+            /* this entry is the next oldest */
+            oldest_saved_plan = plansource->last_access;
+            break;
+        }
+
+        /*
+         * Here, remove generic plans of this plansrouceif it is not actually
+         * used and move it to the beginning of the list. Just update
+         * last_access and move it to the beginning if the plan is used.
+         */
+        if (plansource->gplan->refcount <= 1)
+        {
+            ReleaseGenericPlan(plansource);
+            nremoved++;
+        }
+
+        plansource->last_access = currclock;
+    }
+
+    /* move the "removed" plansrouces altogehter to the beginning of the list */
+    if (plansource != last_saved_plan && plansource)
+    {
+        plansource->next_saved->prev_saved = NULL;
+        first_saved_plan->prev_saved = last_saved_plan;
+         last_saved_plan->next_saved = first_saved_plan;
+        first_saved_plan = plansource->next_saved;
+        plansource->next_saved = NULL;
+        last_saved_plan = plansource;
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans);
+}
+
 /*
  * ResetPlanCache: invalidate all cached plans.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b05fb209bb..e49346707d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -79,6 +79,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
 #include "utils/pg_locale.h"
@@ -2105,6 +2106,38 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"min_cached_plans", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum number of cached plans kept on memory."),
+            gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches
thisvalue. -1 means timeout invalidation is always active.")
 
+        },
+        &min_cached_plans,
+        1000, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf466..3f2760ef9d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -126,6 +126,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7b22f9c7bc..599303be56 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -119,6 +120,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h
index ab20aa04b0..f3c5b2010d 100644
--- a/src/include/utils/plancache.h
+++ b/src/include/utils/plancache.h
@@ -110,11 +110,13 @@ typedef struct CachedPlanSource
     bool        is_valid;        /* is the query_list currently valid? */
     int            generation;        /* increments each time we create a plan */
     /* If CachedPlanSource has been saved, it is a member of a global list */
-    struct CachedPlanSource *next_saved;    /* list link, if so */
+    struct CachedPlanSource *prev_saved;    /* list prev link, if so */
+    struct CachedPlanSource *next_saved;    /* list next link, if so */
     /* State kept to help decide whether to use custom or generic plans: */
     double        generic_cost;    /* cost of generic plan, or -1 if not known */
     double        total_custom_cost;    /* total cost of custom plans so far */
     int            num_custom_plans;    /* number of plans included in total */
+    TimestampTz    last_access;    /* timestamp of the last usage */
 } CachedPlanSource;
 
 /*
@@ -143,6 +145,9 @@ typedef struct CachedPlan
     MemoryContext context;        /* context containing this CachedPlan */
 } CachedPlan;
 
+/* GUC variables */
+extern int min_cached_plans;
+extern int plancache_prune_min_age;
 
 extern void InitPlanCache(void);
 extern void ResetPlanCache(void);
-- 
2.16.3

From 8994c6d038b72ff253ad24dda0f0da99e6916b05 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Mar 2018 15:52:18 +0900
Subject: [PATCH 2/4] introduce dynhash pruning

---
 src/backend/utils/hash/dynahash.c | 166 +++++++++++++++++++++++++++++++++-----
 src/include/utils/catcache.h      |  12 +++
 src/include/utils/hsearch.h       |  21 ++++-
 3 files changed, 179 insertions(+), 20 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 785e0faffb..261f8d9577 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -88,6 +88,7 @@
 #include "access/xact.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
+#include "utils/catcache.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
 
@@ -184,6 +185,12 @@ struct HASHHDR
     long        ssize;            /* segment size --- must be power of 2 */
     int            sshift;            /* segment shift = log2(ssize) */
     int            nelem_alloc;    /* number of entries to allocate at once */
+    bool        prunable;        /* true if prunable */
+    HASH_PRUNE_CB    prune_cb;    /* function to call instead of just deleting */
+
+    /* These fields point to variables to control pruning */
+    int           *memory_target;    /* pointer to memory target value in kB */
+    int           *prune_min_age;    /* pointer to prune minimum age value in sec */
 
 #ifdef HASH_STATISTICS
 
@@ -227,16 +234,18 @@ struct HTAB
     int            sshift;            /* segment shift = log2(ssize) */
 };
 
+#define HASHELEMENT_SIZE(ctlp) MAXALIGN(ctlp->prunable ? sizeof(PRUNABLE_HASHELEMENT) : sizeof(HASHELEMENT))
+
 /*
  * Key (also entry) part of a HASHELEMENT
  */
-#define ELEMENTKEY(helem)  (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT)))
+#define ELEMENTKEY(helem, ctlp)  (((char *)(helem)) + HASHELEMENT_SIZE(ctlp))
 
 /*
  * Obtain element pointer given pointer to key
  */
-#define ELEMENT_FROM_KEY(key)  \
-    ((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT))))
+#define ELEMENT_FROM_KEY(key, ctlp)                                        \
+    ((HASHELEMENT *) (((char *) (key)) - HASHELEMENT_SIZE(ctlp)))
 
 /*
  * Fast MOD arithmetic, assuming that y is a power of 2 !
@@ -257,6 +266,7 @@ static HASHSEGMENT seg_alloc(HTAB *hashp);
 static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
+static bool prune_entries(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
 static void hdefault(HTAB *hashp);
 static int    choose_nelem_alloc(Size entrysize);
@@ -499,6 +509,29 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
         hctl->entrysize = info->entrysize;
     }
 
+    /*
+     * Set up pruning.
+     *
+     * We have two knobs to control pruning and a hash can share them of
+     * syscache.
+     *
+     */
+    if (flags & HASH_PRUNABLE)
+    {
+        hctl->prunable = true;
+        hctl->prune_cb = info->prune_cb;
+        if (info->memory_target)
+            hctl->memory_target = info->memory_target;
+        else
+            hctl->memory_target = &cache_memory_target;
+        if (info->prune_min_age)
+            hctl->prune_min_age = info->prune_min_age;
+        else
+            hctl->prune_min_age = &cache_prune_min_age;
+    }
+    else
+        hctl->prunable = false;
+
     /* make local copies of heavily-used constant fields */
     hashp->keysize = hctl->keysize;
     hashp->ssize = hctl->ssize;
@@ -984,7 +1017,7 @@ hash_search_with_hash_value(HTAB *hashp,
     while (currBucket != NULL)
     {
         if (currBucket->hashvalue == hashvalue &&
-            match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0)
+            match(ELEMENTKEY(currBucket, hctl), keyPtr, keysize) == 0)
             break;
         prevBucketPtr = &(currBucket->link);
         currBucket = *prevBucketPtr;
@@ -997,6 +1030,17 @@ hash_search_with_hash_value(HTAB *hashp,
     if (foundPtr)
         *foundPtr = (bool) (currBucket != NULL);
 
+    /* Update access counter if needed */
+    if (hctl->prunable && currBucket &&
+        (action == HASH_FIND || action == HASH_ENTER))
+    {
+        PRUNABLE_HASHELEMENT *prunable_elm =
+            (PRUNABLE_HASHELEMENT *) currBucket;
+        if (prunable_elm->naccess < 2)
+            prunable_elm->naccess++;
+        prunable_elm->last_access = GetCatCacheClock();
+    }
+
     /*
      * OK, now what?
      */
@@ -1004,7 +1048,8 @@ hash_search_with_hash_value(HTAB *hashp,
     {
         case HASH_FIND:
             if (currBucket != NULL)
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
+
             return NULL;
 
         case HASH_REMOVE:
@@ -1033,7 +1078,7 @@ hash_search_with_hash_value(HTAB *hashp,
                  * element, because someone else is going to reuse it the next
                  * time something is added to the table
                  */
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
             }
             return NULL;
 
@@ -1045,7 +1090,7 @@ hash_search_with_hash_value(HTAB *hashp,
         case HASH_ENTER:
             /* Return existing element if found, else create one */
             if (currBucket != NULL)
-                return (void *) ELEMENTKEY(currBucket);
+                return (void *) ELEMENTKEY(currBucket, hctl);
 
             /* disallow inserts if frozen */
             if (hashp->frozen)
@@ -1075,8 +1120,18 @@ hash_search_with_hash_value(HTAB *hashp,
 
             /* copy key into record */
             currBucket->hashvalue = hashvalue;
-            hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize);
+            hashp->keycopy(ELEMENTKEY(currBucket, hctl), keyPtr, keysize);
 
+            /* set access counter */
+            if (hctl->prunable)
+            {
+                PRUNABLE_HASHELEMENT *prunable_elm =
+                    (PRUNABLE_HASHELEMENT *) currBucket;
+                if (prunable_elm->naccess < 2)
+                    prunable_elm->naccess++;
+                prunable_elm->last_access = GetCatCacheClock();
+            }
+            
             /*
              * Caller is expected to fill the data field on return.  DO NOT
              * insert any code that could possibly throw error here, as doing
@@ -1084,7 +1139,7 @@ hash_search_with_hash_value(HTAB *hashp,
              * caller's data structure.
              */
 
-            return (void *) ELEMENTKEY(currBucket);
+            return (void *) ELEMENTKEY(currBucket, hctl);
     }
 
     elog(ERROR, "unrecognized hash action code: %d", (int) action);
@@ -1116,7 +1171,7 @@ hash_update_hash_key(HTAB *hashp,
                      void *existingEntry,
                      const void *newKeyPtr)
 {
-    HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry);
+    HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry, hashp->hctl);
     HASHHDR    *hctl = hashp->hctl;
     uint32        newhashvalue;
     Size        keysize;
@@ -1200,7 +1255,7 @@ hash_update_hash_key(HTAB *hashp,
     while (currBucket != NULL)
     {
         if (currBucket->hashvalue == newhashvalue &&
-            match(ELEMENTKEY(currBucket), newKeyPtr, keysize) == 0)
+            match(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize) == 0)
             break;
         prevBucketPtr = &(currBucket->link);
         currBucket = *prevBucketPtr;
@@ -1234,7 +1289,7 @@ hash_update_hash_key(HTAB *hashp,
 
     /* copy new key into record */
     currBucket->hashvalue = newhashvalue;
-    hashp->keycopy(ELEMENTKEY(currBucket), newKeyPtr, keysize);
+    hashp->keycopy(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize);
 
     /* rest of record is untouched */
 
@@ -1388,8 +1443,8 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp)
 void *
 hash_seq_search(HASH_SEQ_STATUS *status)
 {
-    HTAB       *hashp;
-    HASHHDR    *hctl;
+    HTAB       *hashp = status->hashp;
+    HASHHDR    *hctl = hashp->hctl;
     uint32        max_bucket;
     long        ssize;
     long        segment_num;
@@ -1404,15 +1459,13 @@ hash_seq_search(HASH_SEQ_STATUS *status)
         status->curEntry = curElem->link;
         if (status->curEntry == NULL)    /* end of this bucket */
             ++status->curBucket;
-        return (void *) ELEMENTKEY(curElem);
+        return (void *) ELEMENTKEY(curElem, hctl);
     }
 
     /*
      * Search for next nonempty bucket starting at curBucket.
      */
     curBucket = status->curBucket;
-    hashp = status->hashp;
-    hctl = hashp->hctl;
     ssize = hashp->ssize;
     max_bucket = hctl->max_bucket;
 
@@ -1458,7 +1511,7 @@ hash_seq_search(HASH_SEQ_STATUS *status)
     if (status->curEntry == NULL)    /* end of this bucket */
         ++curBucket;
     status->curBucket = curBucket;
-    return (void *) ELEMENTKEY(curElem);
+    return (void *) ELEMENTKEY(curElem, hctl);
 }
 
 void
@@ -1552,6 +1605,10 @@ expand_table(HTAB *hashp)
      */
     if ((uint32) new_bucket > hctl->high_mask)
     {
+        /* try pruning before expansion. return true on success */
+        if (hctl->prunable && prune_entries(hashp))
+            return true;
+
         hctl->low_mask = hctl->high_mask;
         hctl->high_mask = (uint32) new_bucket | hctl->low_mask;
     }
@@ -1594,6 +1651,77 @@ expand_table(HTAB *hashp)
     return true;
 }
 
+static bool
+prune_entries(HTAB *hashp)
+{
+    HASHHDR           *hctl = hashp->hctl;
+    HASH_SEQ_STATUS status;
+    void            *elm;
+    TimestampTz        currclock = GetCatCacheClock();
+    int                nall = 0,
+                    nremoved = 0;
+
+    Assert(hctl->prunable);
+
+    /* Return if pruning is currently disabled or not doable */
+    if (*hctl->prune_min_age < 0 || hashp->frozen || has_seq_scans(hashp))
+        return false;
+
+    /*
+     * we don't prune before reaching this size. We only consider bucket array
+     * size since it is the significant part of memory usage.
+     */
+    if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize <
+        (Size) *hctl->memory_target * 1024L)
+        return false;
+
+    /* Ok, start pruning. we can use seq scan here. */
+    hash_seq_init(&status, hashp);
+    while ((elm = hash_seq_search(&status)) != NULL)
+    {
+        PRUNABLE_HASHELEMENT *helm =
+            (PRUNABLE_HASHELEMENT *)ELEMENT_FROM_KEY(elm, hctl);
+        long    entry_age;
+        int        us;
+
+        nall++;
+
+        TimestampDifference(helm->last_access, currclock, &entry_age, &us);
+
+        /*
+         * consider pruning if this entry has not been accessed for a certain
+         * time
+         */
+        if (entry_age > *hctl->prune_min_age)
+        {
+            /* Wait for the next chance if this is recently used */
+            if (helm->naccess > 0)
+                helm->naccess--;
+            else
+            {
+                /* just call it if callback is provided, remove otherwise */
+                if (hctl->prune_cb)
+                {
+                    if (hctl->prune_cb(hashp, (void *)elm))
+                        nremoved++;
+                }
+                else
+                {
+                    bool found;
+                    
+                    hash_search(hashp, elm, HASH_REMOVE, &found);
+                    Assert(found);
+                    nremoved++;
+                }
+            }
+        }
+    }
+
+    elog(DEBUG1, "removed %d/%d entries from hash \"%s\"",
+         nremoved, nall, hashp->tabname);
+
+    return nremoved > 0;
+}
 
 static bool
 dir_realloc(HTAB *hashp)
@@ -1667,7 +1795,7 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
         return false;
 
     /* Each element has a HASHELEMENT header plus user data. */
-    elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+    elementSize = HASHELEMENT_SIZE(hctl) + MAXALIGN(hctl->entrysize);
 
     CurrentDynaHashCxt = hashp->hcxt;
     firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 599303be56..b3f73f53d2 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -208,6 +208,18 @@ SetCatCacheClock(TimestampTz ts)
     catcacheclock = ts;
 }
 
+/*
+ * GetCatCacheClock - get timestamp for catcache access record
+ *
+ * This clock is basically provided for catcache usage, but dynahash has a
+ * similar pruning mechanism and wants to use the same clock.
+ */
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 8357faac5a..6e9fa74a4f 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -13,7 +13,7 @@
  */
 #ifndef HSEARCH_H
 #define HSEARCH_H
-
+#include "datatype/timestamp.h"
 
 /*
  * Hash functions must have this signature.
@@ -47,6 +47,7 @@ typedef void *(*HashAllocFunc) (Size request);
  * HASHELEMENT is the private part of a hashtable entry.  The caller's data
  * follows the HASHELEMENT structure (on a MAXALIGN'd boundary).  The hash key
  * is expected to be at the start of the caller's hash entry data structure.
+ * If this hash is prunable, PRUNABLE_HASHELEMENT is used instead.
  */
 typedef struct HASHELEMENT
 {
@@ -54,12 +55,26 @@ typedef struct HASHELEMENT
     uint32        hashvalue;        /* hash function result for this entry */
 } HASHELEMENT;
 
+typedef struct PRUNABLE_HASHELEMENT
+{
+    struct HASHELEMENT *link;    /* link to next entry in same bucket */
+    uint32        hashvalue;        /* hash function result for this entry */
+    TimestampTz    last_access;    /* timestamp of last usage */
+    int            naccess;        /* takes 0 to 2, counted up when used */
+} PRUNABLE_HASHELEMENT;
+
 /* Hash table header struct is an opaque type known only within dynahash.c */
 typedef struct HASHHDR HASHHDR;
 
 /* Hash table control struct is an opaque type known only within dynahash.c */
 typedef struct HTAB HTAB;
 
+/*
+ * Hash pruning callback which is called for the entries which is about to be
+ * pruned and returns false if the entry shuold be kept.
+ */
+typedef bool (*HASH_PRUNE_CB)(HTAB *hashp, void *ent);
+
 /* Parameter data structure for hash_create */
 /* Only those fields indicated by hash_flags need be set */
 typedef struct HASHCTL
@@ -77,6 +92,9 @@ typedef struct HASHCTL
     HashAllocFunc alloc;        /* memory allocator */
     MemoryContext hcxt;            /* memory context to use for allocations */
     HASHHDR    *hctl;            /* location of header in shared mem */
+    HASH_PRUNE_CB    prune_cb;    /* pruning callback. see above. */
+    int           *memory_target;    /* pointer to memory target */
+    int           *prune_min_age;    /* pointer to prune minimum age */
 } HASHCTL;
 
 /* Flags to indicate which parameters are supplied */
@@ -94,6 +112,7 @@ typedef struct HASHCTL
 #define HASH_SHARED_MEM 0x0800    /* Hashtable is in shared memory */
 #define HASH_ATTACH        0x1000    /* Do not initialize hctl */
 #define HASH_FIXED_SIZE 0x2000    /* Initial size is a hard limit */
+#define HASH_PRUNABLE    0x4000    /* pruning setting */
 
 
 /* max_dsize value to indicate expansible directory */
-- 
2.16.3

From d2367a23911ff9d231dab80ec22108950bb3f9fc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Mar 2018 17:31:43 +0900
Subject: [PATCH 3/4] Apply purning to relcache

Implement relcache invalidtion.
---
 src/backend/utils/cache/plancache.c | 163 ------------------------------------
 src/backend/utils/cache/relcache.c  |  25 +++++-
 src/backend/utils/misc/guc.c        |  10 ---
 src/include/utils/plancache.h       |   7 +-
 4 files changed, 25 insertions(+), 180 deletions(-)

diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c
index 701ead152c..0ad3e3c736 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -63,14 +63,12 @@
 #include "storage/lmgr.h"
 #include "tcop/pquery.h"
 #include "tcop/utility.h"
-#include "utils/catcache.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
 #include "utils/rls.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
-#include "utils/timestamp.h"
 
 
 /*
@@ -88,12 +86,6 @@
  * guarantee to save a CachedPlanSource without error.
  */
 static CachedPlanSource *first_saved_plan = NULL;
-static CachedPlanSource *last_saved_plan = NULL;
-static int                 num_saved_plans = 0;
-static TimestampTz         oldest_saved_plan = 0;
-
-/* GUC variables */
-int                         min_cached_plans = 1000;
 
 static void ReleaseGenericPlan(CachedPlanSource *plansource);
 static List *RevalidateCachedQuery(CachedPlanSource *plansource,
@@ -113,7 +105,6 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list);
 static void PlanCacheRelCallback(Datum arg, Oid relid);
 static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue);
 static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue);
-static void PruneCachedPlan(void);
 
 
 /*
@@ -217,8 +208,6 @@ CreateCachedPlan(RawStmt *raw_parse_tree,
     plansource->generic_cost = -1;
     plansource->total_custom_cost = 0;
     plansource->num_custom_plans = 0;
-    plansource->last_access = GetCatCacheClock();
-    
 
     MemoryContextSwitchTo(oldcxt);
 
@@ -434,28 +423,6 @@ CompleteCachedPlan(CachedPlanSource *plansource,
     plansource->is_valid = true;
 }
 
-/* moves the plansource to the first in the list */
-static inline void
-MovePlansourceToFirst(CachedPlanSource *plansource)
-{
-    if (first_saved_plan != plansource)
-    {
-        /* delink this element */
-        if (plansource->next_saved)
-            plansource->next_saved->prev_saved = plansource->prev_saved;
-        if (plansource->prev_saved)
-            plansource->prev_saved->next_saved = plansource->next_saved;
-        if (last_saved_plan == plansource)
-            last_saved_plan = plansource->prev_saved;
-
-        /* insert at the beginning */
-        first_saved_plan->prev_saved = plansource;
-        plansource->next_saved = first_saved_plan;
-        plansource->prev_saved = NULL;
-        first_saved_plan = plansource;
-    }
-}
-
 /*
  * SaveCachedPlan: save a cached plan permanently
  *
@@ -503,11 +470,6 @@ SaveCachedPlan(CachedPlanSource *plansource)
      * Add the entry to the global list of cached plans.
      */
     plansource->next_saved = first_saved_plan;
-    if (first_saved_plan)
-        first_saved_plan->prev_saved = plansource;
-    else
-        last_saved_plan = plansource;
-    plansource->prev_saved = NULL;
     first_saved_plan = plansource;
 
     plansource->is_saved = true;
@@ -530,11 +492,7 @@ DropCachedPlan(CachedPlanSource *plansource)
     if (plansource->is_saved)
     {
         if (first_saved_plan == plansource)
-        {
             first_saved_plan = plansource->next_saved;
-            if (first_saved_plan)
-                first_saved_plan->prev_saved = NULL;
-        }
         else
         {
             CachedPlanSource *psrc;
@@ -544,19 +502,10 @@ DropCachedPlan(CachedPlanSource *plansource)
                 if (psrc->next_saved == plansource)
                 {
                     psrc->next_saved = plansource->next_saved;
-                    if (psrc->next_saved)
-                        psrc->next_saved->prev_saved = psrc;
                     break;
                 }
             }
         }
-
-        if (last_saved_plan == plansource)
-        {
-            last_saved_plan = plansource->prev_saved;
-            if (last_saved_plan)
-                last_saved_plan->next_saved = NULL;
-        }
         plansource->is_saved = false;
     }
 
@@ -588,13 +537,6 @@ ReleaseGenericPlan(CachedPlanSource *plansource)
         Assert(plan->magic == CACHEDPLAN_MAGIC);
         plansource->gplan = NULL;
         ReleaseCachedPlan(plan, false);
-
-        /* decrement "saved plans" counter */
-        if (plansource->is_saved)
-        {
-            Assert (num_saved_plans > 0);
-            num_saved_plans--;
-        }
     }
 }
 
@@ -1206,17 +1148,6 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
     if (useResOwner && !plansource->is_saved)
         elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan");
 
-    /*
-     * set last-accessed timestamp and move this plan to the first of the list
-     */
-    if (plansource->is_saved)
-    {
-        plansource->last_access = GetCatCacheClock();
-
-        /* move this plan to the first of the list */
-        MovePlansourceToFirst(plansource);
-    }
-
     /* Make sure the querytree list is valid and we have parse-time locks */
     qlist = RevalidateCachedQuery(plansource, queryEnv);
 
@@ -1225,11 +1156,6 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
 
     if (!customplan)
     {
-        /* Prune cached plans if needed */
-        if (plansource->is_saved &&
-            min_cached_plans >= 0 && num_saved_plans > min_cached_plans)
-                PruneCachedPlan();
-
         if (CheckCachedPlan(plansource))
         {
             /* We want a generic plan, and we already have a valid one */
@@ -1242,11 +1168,6 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
             plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv);
             /* Just make real sure plansource->gplan is clear */
             ReleaseGenericPlan(plansource);
-
-            /* count this new saved plan */
-            if (plansource->is_saved)
-                num_saved_plans++;
-
             /* Link the new generic plan into the plansource */
             plansource->gplan = plan;
             plan->refcount++;
@@ -1935,90 +1856,6 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue)
     ResetPlanCache();
 }
 
-/*
- * PrunePlanCache: removes generic plan of "old" saved plans.
- */
-static void
-PruneCachedPlan(void)
-{
-    CachedPlanSource *plansource;
-    TimestampTz          currclock = GetCatCacheClock();
-    long              age;
-    int                  us;
-    int                  nremoved = 0;
-
-    /* do nothing if not wanted */
-    if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans)
-        return;
-
-    /* Fast check for oldest cache */
-    if (oldest_saved_plan > 0)
-    {
-        TimestampDifference(oldest_saved_plan, currclock, &age, &us);
-        if (age < cache_prune_min_age)
-            return;
-    }        
-
-    /* last plan is the oldest. */
-    for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved)
-    {
-        long    plan_age;
-        int        us;
-
-        Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC);
-
-        /* we want to prune no more plans */
-        if (num_saved_plans <= min_cached_plans)
-            break;
-
-        /*
-         * No work if it already doesn't have gplan and move it to the
-         * beginning so that we don't see it at the next time
-         */
-        if (!plansource->gplan)
-            continue;
-
-        /*
-         * Check age for pruning. Can exit immediately when finding a
-         * not-older element.
-         */
-        TimestampDifference(plansource->last_access, currclock, &plan_age, &us);
-        if (plan_age <= cache_prune_min_age)
-        {
-            /* this entry is the next oldest */
-            oldest_saved_plan = plansource->last_access;
-            break;
-        }
-
-        /*
-         * Here, remove generic plans of this plansrouceif it is not actually
-         * used and move it to the beginning of the list. Just update
-         * last_access and move it to the beginning if the plan is used.
-         */
-        if (plansource->gplan->refcount <= 1)
-        {
-            ReleaseGenericPlan(plansource);
-            nremoved++;
-        }
-
-        plansource->last_access = currclock;
-    }
-
-    /* move the "removed" plansrouces altogehter to the beginning of the list */
-    if (plansource != last_saved_plan && plansource)
-    {
-        plansource->next_saved->prev_saved = NULL;
-        first_saved_plan->prev_saved = last_saved_plan;
-         last_saved_plan->next_saved = first_saved_plan;
-        first_saved_plan = plansource->next_saved;
-        plansource->next_saved = NULL;
-        last_saved_plan = plansource;
-    }
-
-    if (nremoved > 0)
-        elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans);
-}
-
 /*
  * ResetPlanCache: invalidate all cached plans.
  */
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6125421d39..19502978cc 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3442,6 +3442,26 @@ RelationSetNewRelfilenode(Relation relation, char persistence,
 
 #define INITRELCACHESIZE        400
 
+/* callback function for hash pruning */
+static bool
+relcache_prune_cb(HTAB *hashp, void *ent)
+{
+    RelIdCacheEnt  *relent = (RelIdCacheEnt *) ent;
+    Relation        relation;
+
+    /* this relation is requested to be removed.  */
+    RelationIdCacheLookup(relent->reloid, relation);
+
+    /* don't remove if currently in use */
+    if (!RelationHasReferenceCountZero(relation))
+        return false;
+
+    /* otherwise we can forget it unconditionally */
+    RelationClearRelation(relation, false);
+
+    return true;
+}
+
 void
 RelationCacheInitialize(void)
 {
@@ -3459,8 +3479,11 @@ RelationCacheInitialize(void)
     MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RelIdCacheEnt);
+
+    /* use the same setting with syscache */
+    ctl.prune_cb = relcache_prune_cb;
     RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE,
-                                  &ctl, HASH_ELEM | HASH_BLOBS);
+                                  &ctl, HASH_ELEM | HASH_BLOBS | HASH_PRUNABLE);
 
     /*
      * relation mapper needs to be initialized too
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e49346707d..d89654cf8a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2128,16 +2128,6 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"min_cached_plans", PGC_USERSET, RESOURCES_MEM,
-            gettext_noop("Sets the minimum number of cached plans kept on memory."),
-            gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches
thisvalue. -1 means timeout invalidation is always active.")
 
-        },
-        &min_cached_plans,
-        1000, -1, INT_MAX,
-        NULL, NULL, NULL
-    },
-
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h
index f3c5b2010d..ab20aa04b0 100644
--- a/src/include/utils/plancache.h
+++ b/src/include/utils/plancache.h
@@ -110,13 +110,11 @@ typedef struct CachedPlanSource
     bool        is_valid;        /* is the query_list currently valid? */
     int            generation;        /* increments each time we create a plan */
     /* If CachedPlanSource has been saved, it is a member of a global list */
-    struct CachedPlanSource *prev_saved;    /* list prev link, if so */
-    struct CachedPlanSource *next_saved;    /* list next link, if so */
+    struct CachedPlanSource *next_saved;    /* list link, if so */
     /* State kept to help decide whether to use custom or generic plans: */
     double        generic_cost;    /* cost of generic plan, or -1 if not known */
     double        total_custom_cost;    /* total cost of custom plans so far */
     int            num_custom_plans;    /* number of plans included in total */
-    TimestampTz    last_access;    /* timestamp of the last usage */
 } CachedPlanSource;
 
 /*
@@ -145,9 +143,6 @@ typedef struct CachedPlan
     MemoryContext context;        /* context containing this CachedPlan */
 } CachedPlan;
 
-/* GUC variables */
-extern int min_cached_plans;
-extern int plancache_prune_min_age;
 
 extern void InitPlanCache(void);
 extern void ResetPlanCache(void);
-- 
2.16.3

From 775a952b51ca4fc597683b80a20380edd1af1328 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 3 Jul 2018 09:05:32 +0900
Subject: [PATCH 4/4] Generic plan removal of PlanCacheSource.

We cannot remove saved cached plans while pruning since they are
pointed from other variables. But still we can remove generic plan of
each saved plans. The behavior is controled by two additional GUC
variables min_cached_plans and cache_prune_min_age. The former tells
to keep that number of generic plans without pruned. The latter tells
how long we shuold keep generic plans before pruning.
---
 src/backend/utils/cache/plancache.c | 163 ++++++++++++++++++++++++++++++++++++
 src/backend/utils/misc/guc.c        |  10 +++
 src/include/utils/plancache.h       |   7 +-
 3 files changed, 179 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c
index 0ad3e3c736..701ead152c 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -63,12 +63,14 @@
 #include "storage/lmgr.h"
 #include "tcop/pquery.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
 #include "utils/rls.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
 /*
@@ -86,6 +88,12 @@
  * guarantee to save a CachedPlanSource without error.
  */
 static CachedPlanSource *first_saved_plan = NULL;
+static CachedPlanSource *last_saved_plan = NULL;
+static int                 num_saved_plans = 0;
+static TimestampTz         oldest_saved_plan = 0;
+
+/* GUC variables */
+int                         min_cached_plans = 1000;
 
 static void ReleaseGenericPlan(CachedPlanSource *plansource);
 static List *RevalidateCachedQuery(CachedPlanSource *plansource,
@@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list);
 static void PlanCacheRelCallback(Datum arg, Oid relid);
 static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue);
 static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue);
+static void PruneCachedPlan(void);
 
 
 /*
@@ -208,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree,
     plansource->generic_cost = -1;
     plansource->total_custom_cost = 0;
     plansource->num_custom_plans = 0;
+    plansource->last_access = GetCatCacheClock();
+    
 
     MemoryContextSwitchTo(oldcxt);
 
@@ -423,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource,
     plansource->is_valid = true;
 }
 
+/* moves the plansource to the first in the list */
+static inline void
+MovePlansourceToFirst(CachedPlanSource *plansource)
+{
+    if (first_saved_plan != plansource)
+    {
+        /* delink this element */
+        if (plansource->next_saved)
+            plansource->next_saved->prev_saved = plansource->prev_saved;
+        if (plansource->prev_saved)
+            plansource->prev_saved->next_saved = plansource->next_saved;
+        if (last_saved_plan == plansource)
+            last_saved_plan = plansource->prev_saved;
+
+        /* insert at the beginning */
+        first_saved_plan->prev_saved = plansource;
+        plansource->next_saved = first_saved_plan;
+        plansource->prev_saved = NULL;
+        first_saved_plan = plansource;
+    }
+}
+
 /*
  * SaveCachedPlan: save a cached plan permanently
  *
@@ -470,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource)
      * Add the entry to the global list of cached plans.
      */
     plansource->next_saved = first_saved_plan;
+    if (first_saved_plan)
+        first_saved_plan->prev_saved = plansource;
+    else
+        last_saved_plan = plansource;
+    plansource->prev_saved = NULL;
     first_saved_plan = plansource;
 
     plansource->is_saved = true;
@@ -492,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource)
     if (plansource->is_saved)
     {
         if (first_saved_plan == plansource)
+        {
             first_saved_plan = plansource->next_saved;
+            if (first_saved_plan)
+                first_saved_plan->prev_saved = NULL;
+        }
         else
         {
             CachedPlanSource *psrc;
@@ -502,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource)
                 if (psrc->next_saved == plansource)
                 {
                     psrc->next_saved = plansource->next_saved;
+                    if (psrc->next_saved)
+                        psrc->next_saved->prev_saved = psrc;
                     break;
                 }
             }
         }
+
+        if (last_saved_plan == plansource)
+        {
+            last_saved_plan = plansource->prev_saved;
+            if (last_saved_plan)
+                last_saved_plan->next_saved = NULL;
+        }
         plansource->is_saved = false;
     }
 
@@ -537,6 +588,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource)
         Assert(plan->magic == CACHEDPLAN_MAGIC);
         plansource->gplan = NULL;
         ReleaseCachedPlan(plan, false);
+
+        /* decrement "saved plans" counter */
+        if (plansource->is_saved)
+        {
+            Assert (num_saved_plans > 0);
+            num_saved_plans--;
+        }
     }
 }
 
@@ -1148,6 +1206,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
     if (useResOwner && !plansource->is_saved)
         elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan");
 
+    /*
+     * set last-accessed timestamp and move this plan to the first of the list
+     */
+    if (plansource->is_saved)
+    {
+        plansource->last_access = GetCatCacheClock();
+
+        /* move this plan to the first of the list */
+        MovePlansourceToFirst(plansource);
+    }
+
     /* Make sure the querytree list is valid and we have parse-time locks */
     qlist = RevalidateCachedQuery(plansource, queryEnv);
 
@@ -1156,6 +1225,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
 
     if (!customplan)
     {
+        /* Prune cached plans if needed */
+        if (plansource->is_saved &&
+            min_cached_plans >= 0 && num_saved_plans > min_cached_plans)
+                PruneCachedPlan();
+
         if (CheckCachedPlan(plansource))
         {
             /* We want a generic plan, and we already have a valid one */
@@ -1168,6 +1242,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
             plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv);
             /* Just make real sure plansource->gplan is clear */
             ReleaseGenericPlan(plansource);
+
+            /* count this new saved plan */
+            if (plansource->is_saved)
+                num_saved_plans++;
+
             /* Link the new generic plan into the plansource */
             plansource->gplan = plan;
             plan->refcount++;
@@ -1856,6 +1935,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue)
     ResetPlanCache();
 }
 
+/*
+ * PrunePlanCache: removes generic plan of "old" saved plans.
+ */
+static void
+PruneCachedPlan(void)
+{
+    CachedPlanSource *plansource;
+    TimestampTz          currclock = GetCatCacheClock();
+    long              age;
+    int                  us;
+    int                  nremoved = 0;
+
+    /* do nothing if not wanted */
+    if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans)
+        return;
+
+    /* Fast check for oldest cache */
+    if (oldest_saved_plan > 0)
+    {
+        TimestampDifference(oldest_saved_plan, currclock, &age, &us);
+        if (age < cache_prune_min_age)
+            return;
+    }        
+
+    /* last plan is the oldest. */
+    for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved)
+    {
+        long    plan_age;
+        int        us;
+
+        Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC);
+
+        /* we want to prune no more plans */
+        if (num_saved_plans <= min_cached_plans)
+            break;
+
+        /*
+         * No work if it already doesn't have gplan and move it to the
+         * beginning so that we don't see it at the next time
+         */
+        if (!plansource->gplan)
+            continue;
+
+        /*
+         * Check age for pruning. Can exit immediately when finding a
+         * not-older element.
+         */
+        TimestampDifference(plansource->last_access, currclock, &plan_age, &us);
+        if (plan_age <= cache_prune_min_age)
+        {
+            /* this entry is the next oldest */
+            oldest_saved_plan = plansource->last_access;
+            break;
+        }
+
+        /*
+         * Here, remove generic plans of this plansrouceif it is not actually
+         * used and move it to the beginning of the list. Just update
+         * last_access and move it to the beginning if the plan is used.
+         */
+        if (plansource->gplan->refcount <= 1)
+        {
+            ReleaseGenericPlan(plansource);
+            nremoved++;
+        }
+
+        plansource->last_access = currclock;
+    }
+
+    /* move the "removed" plansrouces altogehter to the beginning of the list */
+    if (plansource != last_saved_plan && plansource)
+    {
+        plansource->next_saved->prev_saved = NULL;
+        first_saved_plan->prev_saved = last_saved_plan;
+         last_saved_plan->next_saved = first_saved_plan;
+        first_saved_plan = plansource->next_saved;
+        plansource->next_saved = NULL;
+        last_saved_plan = plansource;
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans);
+}
+
 /*
  * ResetPlanCache: invalidate all cached plans.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d89654cf8a..e49346707d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2128,6 +2128,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"min_cached_plans", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum number of cached plans kept on memory."),
+            gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches
thisvalue. -1 means timeout invalidation is always active.")
 
+        },
+        &min_cached_plans,
+        1000, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h
index ab20aa04b0..f3c5b2010d 100644
--- a/src/include/utils/plancache.h
+++ b/src/include/utils/plancache.h
@@ -110,11 +110,13 @@ typedef struct CachedPlanSource
     bool        is_valid;        /* is the query_list currently valid? */
     int            generation;        /* increments each time we create a plan */
     /* If CachedPlanSource has been saved, it is a member of a global list */
-    struct CachedPlanSource *next_saved;    /* list link, if so */
+    struct CachedPlanSource *prev_saved;    /* list prev link, if so */
+    struct CachedPlanSource *next_saved;    /* list next link, if so */
     /* State kept to help decide whether to use custom or generic plans: */
     double        generic_cost;    /* cost of generic plan, or -1 if not known */
     double        total_custom_cost;    /* total cost of custom plans so far */
     int            num_custom_plans;    /* number of plans included in total */
+    TimestampTz    last_access;    /* timestamp of the last usage */
 } CachedPlanSource;
 
 /*
@@ -143,6 +145,9 @@ typedef struct CachedPlan
     MemoryContext context;        /* context containing this CachedPlan */
 } CachedPlan;
 
+/* GUC variables */
+extern int min_cached_plans;
+extern int plancache_prune_min_age;
 
 extern void InitPlanCache(void);
 extern void ResetPlanCache(void);
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Alvaro Herrera
Дата:
On 2018-Jul-02, Andrew Dunstan wrote:

> Andres suggested back in March (and again privately to me) that given how
> much this has changed from the original this CF item should be marked
> Returned With Feedback and the current patchset submitted as a new item.
> 
> Does anyone object to that course of action?

If doing that makes the "CF count" reset back to one for the new
submission, then I object to that course of action.  If we really
think this item does not belong into this commitfest, lets punt it to
the next one.  However, it seems rather strange to do so this early in
the cycle.  Is there really no small item that could be cherry-picked
from this series to be committed standalone?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
Hi,

On 2018-07-02 21:50:36 -0400, Alvaro Herrera wrote:
> On 2018-Jul-02, Andrew Dunstan wrote:
> 
> > Andres suggested back in March (and again privately to me) that given how
> > much this has changed from the original this CF item should be marked
> > Returned With Feedback and the current patchset submitted as a new item.
> > 
> > Does anyone object to that course of action?
> 
> If doing that makes the "CF count" reset back to one for the new
> submission, then I object to that course of action.  If we really
> think this item does not belong into this commitfest, lets punt it to
> the next one.  However, it seems rather strange to do so this early in
> the cycle.  Is there really no small item that could be cherry-picked
> from this series to be committed standalone?

Well, I think it should just have been RWFed last cycle. It got plenty
of feedback. So it doesn't seem that strange to me, not to include it in
the "mop-up" CF? Either way, I don't feel strongly about it, I just know
that I won't have energy for the topic in this CF.

Greetings,

Andres Freund


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
Hi, 

>Subject: Re: Protect syscache from bloating with negative cache entries
>
>Hello. The previous v4 patchset was just broken.

>Somehow the 0004 was merged into the 0003 and applying 0004 results in failure. I
>removed 0004 part from the 0003 and rebased and repost it.

I have some questions about syscache and relcache pruning
though they may be discussed at upper thread or out of point.

Can I confirm about catcache pruning?
syscache_memory_target is the max figure per CatCache.
(Any CatCache has the same max value.)
So the total max size of catalog caches is estimated around or 
slightly more than # of SysCache array times syscache_memory_target.

If correct, I'm thinking writing down the above estimation to the document 
would help db administrators with estimation of memory usage.
Current description might lead misunderstanding that syscache_memory_target
is the total size of catalog cache in my impression.

Related to the above I just thought changing sysycache_memory_target per CatCache
would make memory usage more efficient.
Though I haven't checked if there's a case that each system catalog cache memory usage varies largely,
pg_class cache might need more memory than others and others might need less.
But it would be difficult for users to check each CatCache memory usage and tune it
because right now postgresql hasn't provided a handy way to check them.
Another option is that users only specify the total memory target size and postgres 
dynamically change each CatCache memory target size according to a certain metric.
(, which still seems difficult and expensive to develop per benefit)
What do you think about this?

+       /*                                                                           
+        * Set up pruning.                                                           
+        *                                                                           
+        * We have two knobs to control pruning and a hash can share them of         
+        * syscache.                                                                 
+        *                                                                           
+        */                                                                          
+       if (flags & HASH_PRUNABLE)                                                   
+       {                                                                            
+               hctl->prunable = true;                                               
+               hctl->prune_cb = info->prune_cb;                                     
+               if (info->memory_target)                                             
+                       hctl->memory_target = info->memory_target;                   
+               else                                                                 
+                       hctl->memory_target = &cache_memory_target;                  
+               if (info->prune_min_age)                                             
+                       hctl->prune_min_age = info->prune_min_age;                   
+               else                                                                 
+                       hctl->prune_min_age = &cache_prune_min_age;                  
+       }                                                                            
+       else                                                                         
+               hctl->prunable = false;

As you commented here, guc variable syscache_memory_target and
syscache_prune_min_age are used for both syscache and relcache (HTAB), right?

Do syscache and relcache have the similar amount of memory usage?
If not, I'm thinking that introducing separate guc variable would be fine.
So as syscache_prune_min_age.

Regards,
====================
Takeshi Ideriha
Fujitsu Limited




Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello. Thank you for looking this.

At Wed, 12 Sep 2018 05:16:52 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in
<4E72940DA2BF16479384A86D54D0988A6F197012@G01JPEXMBKW04>
> Hi, 
> 
> >Subject: Re: Protect syscache from bloating with negative cache entries
> >
> >Hello. The previous v4 patchset was just broken.
> 
> >Somehow the 0004 was merged into the 0003 and applying 0004 results in failure. I
> >removed 0004 part from the 0003 and rebased and repost it.
> 
> I have some questions about syscache and relcache pruning
> though they may be discussed at upper thread or out of point.
> 
> Can I confirm about catcache pruning?
> syscache_memory_target is the max figure per CatCache.
> (Any CatCache has the same max value.)
> So the total max size of catalog caches is estimated around or 
> slightly more than # of SysCache array times syscache_memory_target.

Right.

> If correct, I'm thinking writing down the above estimation to the document 
> would help db administrators with estimation of memory usage.
> Current description might lead misunderstanding that syscache_memory_target
> is the total size of catalog cache in my impression.

Honestly I'm not sure that is the right design. Howerver, I don't
think providing such formula to users helps users, since they
don't know exactly how many CatCaches and brothres live in their
server and it is a soft limit, and finally only few or just one
catalogs can reach the limit.

The current design based on the assumption that we would have
only one extremely-growable cache in one use case.

> Related to the above I just thought changing sysycache_memory_target per CatCache
> would make memory usage more efficient.

We could easily have per-cache settings in CatCache, but how do
we provide the knobs for them? I can guess only too much
solutions for that.

> Though I haven't checked if there's a case that each system catalog cache memory usage varies largely,
> pg_class cache might need more memory than others and others might need less.
> But it would be difficult for users to check each CatCache memory usage and tune it
> because right now postgresql hasn't provided a handy way to check them.

I supposed that this is used without such a means. Someone
suffers syscache bloat just can set this GUC to avoid the
bloat. End.

Apart from that, in the current patch, syscache_memory_target is
not exact at all in the first place to avoid overhead to count
the correct size. The major difference comes from the size of
cache tuple itself. But I came to think it is too much to omit.

As a *PoC*, in the attached patch (which applies to current
master), size of CTups are counted as the catcache size.

It also provides pg_catcache_size system view just to give a
rough idea of how such view looks. I'll consider more on that but
do you have any opinion on this?

=# select relid::regclass, indid::regclass, size from pg_syscache_sizes order by size desc;
          relid          |                   indid                   |  size  
-------------------------+-------------------------------------------+--------
 pg_class                | pg_class_oid_index                        | 131072
 pg_class                | pg_class_relname_nsp_index                | 131072
 pg_cast                 | pg_cast_source_target_index               |   5504
 pg_operator             | pg_operator_oprname_l_r_n_index           |   4096
 pg_statistic            | pg_statistic_relid_att_inh_index          |   2048
 pg_proc                 | pg_proc_proname_args_nsp_index            |   2048
..


> Another option is that users only specify the total memory target size and postgres 
> dynamically change each CatCache memory target size according to a certain metric.
> (, which still seems difficult and expensive to develop per benefit)
> What do you think about this?

Given that few caches bloat at once, it's effect is not so
different from the current design.

> As you commented here, guc variable syscache_memory_target and
> syscache_prune_min_age are used for both syscache and relcache (HTAB), right?

Right, just not to add knobs for unclear reasons. Since ...

> Do syscache and relcache have the similar amount of memory usage?

They may be different but would make not so much in the case of
cache bloat.

> If not, I'm thinking that introducing separate guc variable would be fine.
> So as syscache_prune_min_age.

I implemented that so that it is easily replaceable in case, but
I'm not sure separating them makes significant difference..

Thanks for the opinion, I'll put consideration on this more.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bee4afbe4e..6a00141fc9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1617,6 +1617,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 875be180fe..df4256466c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -713,6 +713,9 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     stmtStartTimestamp = GetCurrentTimestamp();
+
+    /* Set this timestamp as aproximated current time */
+    SetCatCacheClock(stmtStartTimestamp);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7251552419..1a1acd9bc7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -938,6 +938,11 @@ REVOKE ALL ON pg_subscription FROM public;
 GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublications)
     ON pg_subscription TO public;
 
+-- XXXXXXXXXXXXXXXXXXXXXX
+CREATE VIEW pg_syscache_sizes AS
+  SELECT *
+  FROM pg_get_syscache_sizes();
+
 
 --
 -- We have a few function definitions in here, too.
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 5ddbf6eab1..aafdc4f8f2 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -498,6 +513,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -849,6 +865,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -866,9 +883,129 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1282,6 +1419,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1813,7 +1955,6 @@ ReleaseCatCacheList(CatCList *list)
         CatCacheRemoveCList(list->my_cache, list);
 }
 
-
 /*
  * CatalogCacheCreateEntry
  *        Create a new CatCTup entry, copying the given HeapTuple and other
@@ -1827,11 +1968,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1850,13 +1993,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1884,8 +2028,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1906,17 +2050,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
@@ -2118,3 +2269,9 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+int
+CatCacheGetSize(CatCache *cache)
+{
+    return cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+}
diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c
index 7271b5880b..490cb8ec8a 100644
--- a/src/backend/utils/cache/plancache.c
+++ b/src/backend/utils/cache/plancache.c
@@ -63,12 +63,14 @@
 #include "storage/lmgr.h"
 #include "tcop/pquery.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/resowner_private.h"
 #include "utils/rls.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
 /*
@@ -86,6 +88,12 @@
  * guarantee to save a CachedPlanSource without error.
  */
 static CachedPlanSource *first_saved_plan = NULL;
+static CachedPlanSource *last_saved_plan = NULL;
+static int                 num_saved_plans = 0;
+static TimestampTz         oldest_saved_plan = 0;
+
+/* GUC variables */
+int                         min_cached_plans = 1000;
 
 static void ReleaseGenericPlan(CachedPlanSource *plansource);
 static List *RevalidateCachedQuery(CachedPlanSource *plansource,
@@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list);
 static void PlanCacheRelCallback(Datum arg, Oid relid);
 static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue);
 static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue);
+static void PruneCachedPlan(void);
 
 /* GUC parameter */
 int    plan_cache_mode;
@@ -210,6 +219,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree,
     plansource->generic_cost = -1;
     plansource->total_custom_cost = 0;
     plansource->num_custom_plans = 0;
+    plansource->last_access = GetCatCacheClock();
+    
 
     MemoryContextSwitchTo(oldcxt);
 
@@ -425,6 +436,28 @@ CompleteCachedPlan(CachedPlanSource *plansource,
     plansource->is_valid = true;
 }
 
+/* moves the plansource to the first in the list */
+static inline void
+MovePlansourceToFirst(CachedPlanSource *plansource)
+{
+    if (first_saved_plan != plansource)
+    {
+        /* delink this element */
+        if (plansource->next_saved)
+            plansource->next_saved->prev_saved = plansource->prev_saved;
+        if (plansource->prev_saved)
+            plansource->prev_saved->next_saved = plansource->next_saved;
+        if (last_saved_plan == plansource)
+            last_saved_plan = plansource->prev_saved;
+
+        /* insert at the beginning */
+        first_saved_plan->prev_saved = plansource;
+        plansource->next_saved = first_saved_plan;
+        plansource->prev_saved = NULL;
+        first_saved_plan = plansource;
+    }
+}
+
 /*
  * SaveCachedPlan: save a cached plan permanently
  *
@@ -472,6 +505,11 @@ SaveCachedPlan(CachedPlanSource *plansource)
      * Add the entry to the global list of cached plans.
      */
     plansource->next_saved = first_saved_plan;
+    if (first_saved_plan)
+        first_saved_plan->prev_saved = plansource;
+    else
+        last_saved_plan = plansource;
+    plansource->prev_saved = NULL;
     first_saved_plan = plansource;
 
     plansource->is_saved = true;
@@ -494,7 +532,11 @@ DropCachedPlan(CachedPlanSource *plansource)
     if (plansource->is_saved)
     {
         if (first_saved_plan == plansource)
+        {
             first_saved_plan = plansource->next_saved;
+            if (first_saved_plan)
+                first_saved_plan->prev_saved = NULL;
+        }
         else
         {
             CachedPlanSource *psrc;
@@ -504,10 +546,19 @@ DropCachedPlan(CachedPlanSource *plansource)
                 if (psrc->next_saved == plansource)
                 {
                     psrc->next_saved = plansource->next_saved;
+                    if (psrc->next_saved)
+                        psrc->next_saved->prev_saved = psrc;
                     break;
                 }
             }
         }
+
+        if (last_saved_plan == plansource)
+        {
+            last_saved_plan = plansource->prev_saved;
+            if (last_saved_plan)
+                last_saved_plan->next_saved = NULL;
+        }
         plansource->is_saved = false;
     }
 
@@ -539,6 +590,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource)
         Assert(plan->magic == CACHEDPLAN_MAGIC);
         plansource->gplan = NULL;
         ReleaseCachedPlan(plan, false);
+
+        /* decrement "saved plans" counter */
+        if (plansource->is_saved)
+        {
+            Assert (num_saved_plans > 0);
+            num_saved_plans--;
+        }
     }
 }
 
@@ -1156,6 +1214,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
     if (useResOwner && !plansource->is_saved)
         elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan");
 
+    /*
+     * set last-accessed timestamp and move this plan to the first of the list
+     */
+    if (plansource->is_saved)
+    {
+        plansource->last_access = GetCatCacheClock();
+
+        /* move this plan to the first of the list */
+        MovePlansourceToFirst(plansource);
+    }
+
     /* Make sure the querytree list is valid and we have parse-time locks */
     qlist = RevalidateCachedQuery(plansource, queryEnv);
 
@@ -1164,6 +1233,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
 
     if (!customplan)
     {
+        /* Prune cached plans if needed */
+        if (plansource->is_saved &&
+            min_cached_plans >= 0 && num_saved_plans > min_cached_plans)
+                PruneCachedPlan();
+
         if (CheckCachedPlan(plansource))
         {
             /* We want a generic plan, and we already have a valid one */
@@ -1176,6 +1250,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams,
             plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv);
             /* Just make real sure plansource->gplan is clear */
             ReleaseGenericPlan(plansource);
+
+            /* count this new saved plan */
+            if (plansource->is_saved)
+                num_saved_plans++;
+
             /* Link the new generic plan into the plansource */
             plansource->gplan = plan;
             plan->refcount++;
@@ -1864,6 +1943,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue)
     ResetPlanCache();
 }
 
+/*
+ * PrunePlanCache: removes generic plan of "old" saved plans.
+ */
+static void
+PruneCachedPlan(void)
+{
+    CachedPlanSource *plansource;
+    TimestampTz          currclock = GetCatCacheClock();
+    long              age;
+    int                  us;
+    int                  nremoved = 0;
+
+    /* do nothing if not wanted */
+    if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans)
+        return;
+
+    /* Fast check for oldest cache */
+    if (oldest_saved_plan > 0)
+    {
+        TimestampDifference(oldest_saved_plan, currclock, &age, &us);
+        if (age < cache_prune_min_age)
+            return;
+    }        
+
+    /* last plan is the oldest. */
+    for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved)
+    {
+        long    plan_age;
+        int        us;
+
+        Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC);
+
+        /* we want to prune no more plans */
+        if (num_saved_plans <= min_cached_plans)
+            break;
+
+        /*
+         * No work if it already doesn't have gplan and move it to the
+         * beginning so that we don't see it at the next time
+         */
+        if (!plansource->gplan)
+            continue;
+
+        /*
+         * Check age for pruning. Can exit immediately when finding a
+         * not-older element.
+         */
+        TimestampDifference(plansource->last_access, currclock, &plan_age, &us);
+        if (plan_age <= cache_prune_min_age)
+        {
+            /* this entry is the next oldest */
+            oldest_saved_plan = plansource->last_access;
+            break;
+        }
+
+        /*
+         * Here, remove generic plans of this plansrouceif it is not actually
+         * used and move it to the beginning of the list. Just update
+         * last_access and move it to the beginning if the plan is used.
+         */
+        if (plansource->gplan->refcount <= 1)
+        {
+            ReleaseGenericPlan(plansource);
+            nremoved++;
+        }
+
+        plansource->last_access = currclock;
+    }
+
+    /* move the "removed" plansrouces altogehter to the beginning of the list */
+    if (plansource != last_saved_plan && plansource)
+    {
+        plansource->next_saved->prev_saved = NULL;
+        first_saved_plan->prev_saved = last_saved_plan;
+         last_saved_plan->next_saved = first_saved_plan;
+        first_saved_plan = plansource->next_saved;
+        plansource->next_saved = NULL;
+        last_saved_plan = plansource;
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans);
+}
+
 /*
  * ResetPlanCache: invalidate all cached plans.
  */
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index 2b381782a3..9cdb75afb8 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -73,9 +73,14 @@
 #include "catalog/pg_ts_template.h"
 #include "catalog/pg_type.h"
 #include "catalog/pg_user_mapping.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "nodes/execnodes.h"
 #include "utils/rel.h"
 #include "utils/catcache.h"
 #include "utils/syscache.h"
+#include "utils/tuplestore.h"
+#include "utils/fmgrprotos.h"
 
 
 /*---------------------------------------------------------------------------
@@ -1530,6 +1535,64 @@ RelationSupportsSysCache(Oid relid)
 }
 
 
+/*
+ * rough size of this syscache
+ */
+Datum
+pg_get_syscache_sizes(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 3
+    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc    tupdesc;
+    Tuplestorestate *tupstore;
+    MemoryContext per_query_ctx;
+    MemoryContext oldcontext;
+    int    cacheId;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE];
+        int i;
+
+        memset(nulls, 0, sizeof(nulls));
+
+        i = 0;
+        values[i++] = cacheinfo[cacheId].reloid;
+        values[i++] = cacheinfo[cacheId].indoid;
+        values[i++] = Int64GetDatum(CatCacheGetSize(SysCache[cacheId]));
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    tuplestore_donestoring(tupstore);
+
+    return (Datum) 0;
+}
+
 /*
  * OID comparator for pg_qsort
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0625eff219..3154574f62 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -79,6 +79,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2113,6 +2114,38 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"min_cached_plans", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum number of cached plans kept on memory."),
+            gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches
thisvalue. -1 means timeout invalidation is always active.")
 
+        },
+        &min_cached_plans,
+        1000, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7486d20a34..917d7cb5cf 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -126,6 +126,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 860571440a..c0bfcc9f70 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9800,6 +9800,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3423',
+  descr => 'syscache size',
+  proname => 'pg_get_syscache_sizes', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => '',
+  proallargtypes => '{oid,oid,int8}',
+  proargmodes => '{o,o,o}',
+  proargnames => '{relid,indid,size}',
+  prosrc => 'pg_get_syscache_sizes' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7b22f9c7bc..9c326d6af6 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +193,28 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
@@ -227,5 +253,6 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
+extern int CatCacheGetSize(CatCache *cache);
 
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h
index 5fc7903a06..338b3470b7 100644
--- a/src/include/utils/plancache.h
+++ b/src/include/utils/plancache.h
@@ -110,11 +110,13 @@ typedef struct CachedPlanSource
     bool        is_valid;        /* is the query_list currently valid? */
     int            generation;        /* increments each time we create a plan */
     /* If CachedPlanSource has been saved, it is a member of a global list */
-    struct CachedPlanSource *next_saved;    /* list link, if so */
+    struct CachedPlanSource *prev_saved;    /* list prev link, if so */
+    struct CachedPlanSource *next_saved;    /* list next link, if so */
     /* State kept to help decide whether to use custom or generic plans: */
     double        generic_cost;    /* cost of generic plan, or -1 if not known */
     double        total_custom_cost;    /* total cost of custom plans so far */
     int            num_custom_plans;    /* number of plans included in total */
+    TimestampTz    last_access;    /* timestamp of the last usage */
 } CachedPlanSource;
 
 /*
@@ -143,6 +145,9 @@ typedef struct CachedPlan
     MemoryContext context;        /* context containing this CachedPlan */
 } CachedPlan;
 
+/* GUC variables */
+extern int min_cached_plans;
+extern int plancache_prune_min_age;
 
 extern void InitPlanCache(void);
 extern void ResetPlanCache(void);

RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
Hi, thank you for the explanation.

>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
>>
>> Can I confirm about catcache pruning?
>> syscache_memory_target is the max figure per CatCache.
>> (Any CatCache has the same max value.) So the total max size of
>> catalog caches is estimated around or slightly more than # of SysCache
>> array times syscache_memory_target.
>
>Right.
>
>> If correct, I'm thinking writing down the above estimation to the
>> document would help db administrators with estimation of memory usage.
>> Current description might lead misunderstanding that
>> syscache_memory_target is the total size of catalog cache in my impression.
>
>Honestly I'm not sure that is the right design. Howerver, I don't think providing such
>formula to users helps users, since they don't know exactly how many CatCaches and
>brothres live in their server and it is a soft limit, and finally only few or just one catalogs
>can reach the limit.

Yeah, I agree with that kind of formula is not suited for the document.
But if users don't know how many catcaches and brothers is used at postgres,
then how about changing syscache_memory_target as total soft limit of catcache,
rather than size limit of individual catcache. Internally syscache_memory_target can 
be divided by the number of Syscache and does its work. The total amount would be
easier to understand for users who don't know the detailed contents of catalog caches.

Or if user can tell how many/what kind of catcaches exists, for instance by using 
the system view you provided in the previous email, the current design looks good to me.

>The current design based on the assumption that we would have only one
>extremely-growable cache in one use case.
>
>> Related to the above I just thought changing sysycache_memory_target
>> per CatCache would make memory usage more efficient.
>
>We could easily have per-cache settings in CatCache, but how do we provide the knobs
>for them? I can guess only too much solutions for that.
Agreed.

>> Though I haven't checked if there's a case that each system catalog
>> cache memory usage varies largely, pg_class cache might need more memory than
>others and others might need less.
>> But it would be difficult for users to check each CatCache memory
>> usage and tune it because right now postgresql hasn't provided a handy way to
>check them.
>
>I supposed that this is used without such a means. Someone suffers syscache bloat
>just can set this GUC to avoid the bloat. End.
Yeah, I took the purpose wrong.

>Apart from that, in the current patch, syscache_memory_target is not exact at all in
>the first place to avoid overhead to count the correct size. The major difference comes
>from the size of cache tuple itself. But I came to think it is too much to omit.
>
>As a *PoC*, in the attached patch (which applies to current master), size of CTups are
>counted as the catcache size.
>
>It also provides pg_catcache_size system view just to give a rough idea of how such
>view looks. I'll consider more on that but do you have any opinion on this?
>
>=# select relid::regclass, indid::regclass, size from pg_syscache_sizes order by size
>desc;
>          relid          |                   indid                   |  size
>-------------------------+-------------------------------------------+--
>-------------------------+-------------------------------------------+--
>-------------------------+-------------------------------------------+--
>-------------------------+-------------------------------------------+--
> pg_class                | pg_class_oid_index                        | 131072
> pg_class                | pg_class_relname_nsp_index                | 131072
> pg_cast                 | pg_cast_source_target_index               |   5504
> pg_operator             | pg_operator_oprname_l_r_n_index           |   4096
> pg_statistic            | pg_statistic_relid_att_inh_index          |   2048
> pg_proc                 | pg_proc_proname_args_nsp_index            |   2048
>..

Great! I like this view.
One of the extreme idea would be adding all the members printed by CatCachePrintStats(), 
which is only enabled with -DCATCACHE_STATS at this moment. 
All of the members seems too much for customers who tries to change the cache limit size
But it may be some of the members are useful because for example cc_hits would indicate that current 
cache limit size is too small.

>> Another option is that users only specify the total memory target size
>> and postgres dynamically change each CatCache memory target size according to a
>certain metric.
>> (, which still seems difficult and expensive to develop per benefit)
>> What do you think about this?
>
>Given that few caches bloat at once, it's effect is not so different from the current
>design.
Yes agreed.

>> As you commented here, guc variable syscache_memory_target and
>> syscache_prune_min_age are used for both syscache and relcache (HTAB), right?
>
>Right, just not to add knobs for unclear reasons. Since ...
>
>> Do syscache and relcache have the similar amount of memory usage?
>
>They may be different but would make not so much in the case of cache bloat.
>> If not, I'm thinking that introducing separate guc variable would be fine.
>> So as syscache_prune_min_age.
>
>I implemented that so that it is easily replaceable in case, but I'm not sure separating
>them makes significant difference..
Maybe I was overthinking mixing my development. 

Regards,
Takeshi Ideriha



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello. Thank you for the comment.

At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in
<4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04>
> >As a *PoC*, in the attached patch (which applies to current master), size of CTups are
> >counted as the catcache size.
> >
> >It also provides pg_catcache_size system view just to give a rough idea of how such
> >view looks. I'll consider more on that but do you have any opinion on this?
> >
...
> Great! I like this view.
> One of the extreme idea would be adding all the members printed by CatCachePrintStats(), 
> which is only enabled with -DCATCACHE_STATS at this moment. 
> All of the members seems too much for customers who tries to change the cache limit size
> But it may be some of the members are useful because for example cc_hits would indicate that current 
> cache limit size is too small.

The attached introduces four features below. (But the features on
relcache and plancache are omitted).

1. syscache stats collector (in 0002)

Records syscache status consists of the same columns above and
"ageclass" information. We could somehow triggering a stats
report with signal but we don't want take/send/write the
statistics in signal handler. Instead, it is turned on by setting
track_syscache_usage_interval to a positive number in
milliseconds.

2. pg_stat_syscache view.  (in 0002)

This view shows catcache statistics. Statistics is taken only on
the backends where syscache tracking is active.

>  pid  | application_name |    relname     |            cache_name             |   size   |        ageclass         |
      nentries          
 
>
------+------------------+----------------+-----------------------------------+----------+-------------------------+---------------------------
>  9984 | psql             | pg_statistic   | pg_statistic_relid_att_inh_index  | 12676096 | {30,60,600,1200,1800,0} |
{17660,17310,55870,0,0,0}

Age class is the basis of catcache truncation mechanism and shows
the distribution based on elapsed time since last access. As I
didn't came up an appropriate way, it is represented as two
arrays.  Ageclass stores maximum age for each class in
seconds. Nentries holds entry numbers correnponding to the same
element in ageclass. In the above example,

     age class  : # of entries in the cache
   up to   30s  : 17660
   up to   60s  : 17310
   up to  600s  : 55870
   up to 1200s  : 0
   up to 1800s  : 0
   more longer  : 0

 The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of
 cache_prune_min_age on the backend.

3. non-transactional GUC setting (in 0003)

It allows setting GUC variable set by the action
GUC_ACTION_NONXACT(the name requires condieration) survive beyond
rollback. It is required by remote guc setting to work
sanely. Without the feature a remote-set value within a trasction
will disappear involved in rollback. The only local interface for
the NONXACT action is set_config(name, value, is_local=false,
is_nonxact = true). pg_set_backend_guc() below works on this
feature.

4. pg_set_backend_guc() function.

Of course syscache statistics recording consumes significant
amount of time so it cannot be turned on usually. On the other
hand since this feature is turned on by GUC, it is needed to grab
the active client connection to turn on/off the feature(but we
cannot). Instead, I provided a means to change GUC variables in
another backend.

pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
on the backend "pid" to "value".



With the above tools, we can inspect catcache statistics of
seemingly bloated process.

A. Find a bloated process pid using ps or something.

B. Turn on syscache stats on the process.
  =# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', '10000');

C. Examine the statitics.

=# select pid, relname, cache_name, size from pg_stat_syscache order by size desc limit 3;
 pid  |   relname    |            cache_name            |   size   
------+--------------+----------------------------------+----------
 9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112
 9984 | pg_cast      | pg_cast_source_target_index      |     4096
 9984 | pg_operator  | pg_operator_oprname_l_r_n_index  |     4096


=# select * from pg_stat_syscache where cache_name = 'pg_statistic_relid_att_inh_index'::regclass;
-[ RECORD 1 ]---------------------------------
pid         | 9984
relname     | pg_statistic
cache_name  | pg_statistic_relid_att_inh_index
size        | 11026176
ntuples     | 77950
searches    | 77950
hits        | 0
neg_hits    | 0
ageclass    | {30,60,600,1200,1800,0}
nentries    | {17630,16950,43370,0,0,0}
last_update | 2018-10-17 15:58:19.738164+09


> >> Another option is that users only specify the total memory target size
> >> and postgres dynamically change each CatCache memory target size according to a
> >certain metric.
> >> (, which still seems difficult and expensive to develop per benefit)
> >> What do you think about this?
> >
> >Given that few caches bloat at once, it's effect is not so different from the current
> >design.
> Yes agreed.
> 
> >> As you commented here, guc variable syscache_memory_target and
> >> syscache_prune_min_age are used for both syscache and relcache (HTAB), right?
> >
> >Right, just not to add knobs for unclear reasons. Since ...
> >
> >> Do syscache and relcache have the similar amount of memory usage?
> >
> >They may be different but would make not so much in the case of cache bloat.
> >> If not, I'm thinking that introducing separate guc variable would be fine.
> >> So as syscache_prune_min_age.
> >
> >I implemented that so that it is easily replaceable in case, but I'm not sure separating
> >them makes significant difference..
> Maybe I was overthinking mixing my development. 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 4125f38c439d305797907bb95e5a35c7f869244e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   5 +
 src/backend/utils/cache/catcache.c            | 166 ++++++++++++++++++++++++--
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  28 ++++-
 6 files changed, 254 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7554cba3f9..c3133d742b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1618,6 +1618,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8c1621d949..083b6dc7aa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -733,7 +733,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     if (!IsParallelWorker())
+    {
         stmtStartTimestamp = GetCurrentTimestamp();
+
+        /* Set this timestamp as aproximated current time */
+        SetCatCacheClock(stmtStartTimestamp);
+    }
     else
         Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 5ddbf6eab1..9be463311d 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -498,6 +513,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -849,6 +865,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -866,9 +883,129 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1282,6 +1419,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1827,11 +1969,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1850,13 +1994,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1884,8 +2029,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1906,17 +2051,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2317e8be6b..1a49d576fa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -80,6 +80,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2113,6 +2114,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521..c59dd898ac 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -126,6 +126,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7b22f9c7bc..ace4178619 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +193,28 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From c565131cf9db0d0b6e475a101ec247bbfc2df8ab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 15:48:28 +0900
Subject: [PATCH 2/3] Syscache usage tracking feature.

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  15 ++
 src/backend/catalog/system_views.sql          |  18 +++
 src/backend/postmaster/pgstat.c               | 206 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 136 +++++++++++++++++
 src/backend/utils/cache/catcache.c            | 115 ++++++++++----
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   7 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 17 files changed, 562 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c3133d742b..976a505205 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6106,6 +6106,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval">
+      <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect system cache usage statistics in
+        milliseconds. This parameter is 0 by default, which means disabled.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a03b005f73..6cd19c8ecb 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -903,6 +903,23 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.nentries            AS nentries,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1176,6 +1193,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8a5b2b3b42..572d181b75 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/tqual.h"
 
@@ -125,6 +126,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -237,6 +239,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only databsae stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -631,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -645,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -684,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -4290,6 +4315,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6380,3 +6408,163 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and resturns the
+ * time remaining in the current interval in milliseconds. If'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+        return 0;
+
+    
+    /* Check aginst the in*/
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now write the file */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
+
+/*
+ * GUC assignment callback for track_syscache_usage_interval.
+ *
+ * Make a statistics file immedately when syscache statistics is turned
+ * on. Remove it as soon as turned off as well.
+ */
+void
+pgstat_track_syscache_assign_hook(int newval, void *extra)
+{
+    if (newval > 0)
+    {
+        /*
+         * Immediately create a stats file. It's safe since we're not midst
+         * accessing syscache.
+         */
+        pgstat_write_syscache_stats(true);
+    }
+    else
+    {
+        /* Turned off, immediately remove the statsfile */
+        char    fname[MAXPGPATH];
+
+        pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                         fname, MAXPGPATH);
+        unlink(fname);        /* don't care of the result */
+    }
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e4c6e3d406..c68b857c0e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3121,6 +3121,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 }
@@ -3697,6 +3703,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_catcache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4137,9 +4144,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_catcache_update_timeout = true;
+                    enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4182,6 +4199,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_catcache_update_timeout)
+        {
+            disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false);
+            disable_idle_catcache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e95e347184..27df8cf825 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1882,3 +1885,136 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 10
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE];
+        Datum datums[SYSCACHE_STATS_NAGECLASSES];
+        ArrayType *arr;
+        int i, j;
+
+        fread(&cacheid, sizeof(int), 1, fpin);
+        fread(&last_update, sizeof(TimestampTz), 1, fpin);
+        if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        memset(nulls, 0, sizeof(nulls));
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+            datums[j] = Int32GetDatum((int32) stats.ageclasses[j]);
+
+        arr = construct_array(datums, SYSCACHE_STATS_NAGECLASSES,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+            datums[j] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        arr = construct_array(datums, SYSCACHE_STATS_NAGECLASSES,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 9be463311d..31e19541a6 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Timestamp used for any operation on caches. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -627,9 +631,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -705,9 +707,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -914,10 +914,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
      * cache_prune_min_age. The index of nremoved_entry is the value of the
      * clock-sweep counter, which takes from 0 up to 2.
      */
-    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
-    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0};
     int            nremoved_entry[3] = {0, 0, 0};
     int            j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
 #endif
 
     /* Return immediately if no pruning is wanted */
@@ -931,7 +932,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
     
-    /* Search the whole hash for entries to remove */
+    /*
+     * Search the whole hash for entries to remove. This is a quite time
+     * consuming task during catcache lookup, but accetable since now we are
+     * going to expand the hash table.
+     */
     for (i = 0; i < cp->cc_nbuckets; i++)
     {
         dlist_mutable_iter iter;
@@ -944,21 +949,21 @@ CatCacheCleanupOldEntries(CatCache *cp)
 
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
 #ifdef CATCACHE_STATS
             /* count catcache entries for each age class */
             ntotal++;
-            for (j = 0 ;
-                 ageclass[j] != 0.0 &&
-                     entry_age > cache_prune_min_age * ageclass[j] ;
-                 j++);
-            if (ageclass[j] == 0.0) j--;
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > cache_prune_min_age * ageclass[j])
+                j++;
             nentries[j]++;
 #endif
 
@@ -991,14 +996,17 @@ CatCacheCleanupOldEntries(CatCache *cp)
     }
 
 #ifdef CATCACHE_STATS
+    StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6,
+                     "number of syscache age class must be 6");
     ereport(DEBUG1,
-            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d)
naccessed(0:%d,1:%d, 2:%d)",
 
                      nremoved, ntotal,
                      ageclass[0] * cache_prune_min_age, nentries[0],
                      ageclass[1] * cache_prune_min_age, nentries[1],
                      ageclass[2] * cache_prune_min_age, nentries[2],
                      ageclass[3] * cache_prune_min_age, nentries[3],
                      ageclass[4] * cache_prune_min_age, nentries[4],
+                     nentries[5],
                      nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
              errhidestmt(true)));
 #endif
@@ -1375,9 +1383,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1437,9 +1443,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1448,9 +1452,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1578,9 +1580,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1691,9 +1691,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1750,9 +1748,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2270,3 +2266,64 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /* cache_prune_min_age can be changed on-session, fill it every time */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index 2b381782a3..9800bfda34 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1529,6 +1532,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 5971310aab..234ae3e157 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4f1d2a0d28..000f402a03 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1a49d576fa..c4a1616136 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3077,6 +3077,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c59dd898ac..9b3ccc5e5b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -514,6 +514,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_syscache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cff58ed2d8..86c84c7cf4 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9603,6 +9603,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3423',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,nentries,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 69f356f8cd..c056d9a39f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -81,6 +81,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d59c24ae23..b64bc499e4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1133,6 +1133,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1217,7 +1218,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1352,5 +1354,6 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
+extern void pgstat_track_syscache_assign_hook(int newval, void *extra);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index ace4178619..721948b4cc 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -65,10 +65,8 @@ typedef struct catcache
     int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -81,7 +79,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 4f333586ee..0cd7cc4394 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index dcc7307c16..e2a9c33f14 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_CATCACHE_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From 2a7a6744c61a61a8dac2fb54f948b96d58141778 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 21:31:22 +0900
Subject: [PATCH 3/3] Remote GUC setting feature and non-xact GUC config.

This adds two features at once. (will be split later).

One is non-transactional GUC setting feature. This allows setting GUC
variable set by the action GUC_ACTION_NONXACT(the name requires
condieration) survive beyond rollback. It is required by remote guc
setting to work sanely. Without the feature a remote-set value within
a trasction will disappear involved in rollback. The only local
interface for the NONXACT action is set_config(name, value,
is_local=false, is_nonxact = true).

The second is remote guc setting feature. It uses ProcSignal to notify
the target server.
---
 doc/src/sgml/config.sgml             |   4 +
 doc/src/sgml/func.sgml               |  30 ++
 src/backend/catalog/system_views.sql |   7 +-
 src/backend/postmaster/pgstat.c      |   3 +
 src/backend/storage/ipc/ipci.c       |   2 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/tcop/postgres.c          |  10 +
 src/backend/utils/misc/README        |  26 +-
 src/backend/utils/misc/guc.c         | 619 +++++++++++++++++++++++++++++++++--
 src/include/catalog/pg_proc.dat      |  10 +-
 src/include/pgstat.h                 |   3 +-
 src/include/storage/procsignal.h     |   3 +
 src/include/utils/guc.h              |  13 +-
 src/include/utils/guc_tables.h       |   5 +-
 src/test/regress/expected/guc.out    | 223 +++++++++++++
 src/test/regress/expected/rules.out  |  26 +-
 src/test/regress/sql/guc.sql         |  88 +++++
 17 files changed, 1027 insertions(+), 49 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 976a505205..34f7a08bae 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -281,6 +281,10 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter
      </listitem>
     </itemizedlist>
 
+    <para>
+     Also values on other sessions can be set using the SQL
+     function <function>pg_set_backend_setting</function>.
+    </para>
    </sect2>
 
    <sect2>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5193df3366..b97f4e5daa 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18657,6 +18657,20 @@ SELECT collation for ('foo' COLLATE "de_DE");
        <entry><type>text</type></entry>
        <entry>set parameter and return new value</entry>
       </row>
+      <row>
+       <entry>
+        <indexterm>
+         <primary>pg_set_backend_setting</primary>
+        </indexterm>
+        <literal><function>pg_set_backend_config(
+                            <parameter>process_id</parameter>,
+                            <parameter>setting_name</parameter>,
+                            <parameter>new_value</parameter>)
+                            </function></literal>
+       </entry>
+       <entry><type>bool</type></entry>
+       <entry>set parameter on another session</entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
@@ -18711,6 +18725,22 @@ SELECT set_config('log_statement_stats', 'off', false);
 ------------
  off
 (1 row)
+</programlisting>
+   </para>
+
+   <para>
+    <function>pg_set_backend_config</function> sets the parameter
+    <parameter>setting_name</parameter> to
+    <parameter>new_value</parameter> on the other session with PID
+    <parameter>process_id</parameter>. The setting is always session-local and
+    returns true if succeeded.  An example:
+<programlisting>
+SELECT pg_set_backend_config(2134, 'work_mem', '16MB');
+
+pg_set_backend_config
+------------
+ t
+(1 row)
 </programlisting>
    </para>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 6cd19c8ecb..6403a461e7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -474,7 +474,7 @@ CREATE VIEW pg_settings AS
 CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_settings
     WHERE new.name = old.name DO
-    SELECT set_config(old.name, new.setting, 'f');
+    SELECT set_config(old.name, new.setting, 'f', 'f');
 
 CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_settings
@@ -1044,6 +1044,11 @@ CREATE OR REPLACE FUNCTION pg_stop_backup (
   RETURNS SETOF record STRICT VOLATILE LANGUAGE internal as 'pg_stop_backup_v2'
   PARALLEL RESTRICTED;
 
+CREATE OR REPLACE FUNCTION set_config (
+        setting_name text, new_value text, is_local boolean, is_nonxact boolean DEFAULT false)
+        RETURNS text STRICT VOLATILE LANGUAGE internal AS 'set_config_by_name'
+        PARALLEL UNSAFE;
+
 -- legacy definition for compatibility with 9.3
 CREATE OR REPLACE FUNCTION
   json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 572d181b75..80c60eefca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3708,6 +3708,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_REMOTE_GUC:
+            event_name = "RemoteGUC";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..03d526d12d 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, GucShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     SyncScanShmemInit();
     AsyncShmemInit();
     BackendRandomShmemInit();
+    GucShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index b0dd7d1b37..b897c36bae 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/guc.h"
 
 
 /*
@@ -292,6 +293,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
     if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
         RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+    if (CheckProcSignal(PROCSIG_REMOTE_GUC))
+        HandleRemoteGucSetInterrupt();
+
     SetLatch(MyLatch);
 
     latch_sigusr1_handler();
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c68b857c0e..feee7bdbb1 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3129,6 +3129,10 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    /* We don't want chage GUC variables while running a query */
+    if (RemoteGucChangePending && DoingCommandRead)
+        HandleGucRemoteChanges();
 }
 
 
@@ -4165,6 +4169,12 @@ PostgresMain(int argc, char *argv[],
             send_ready_for_query = false;
         }
 
+        /*
+         * (2.5) Process some pending works.
+         */
+        if (RemoteGucChangePending)
+            HandleGucRemoteChanges();
+
         /*
          * (2) Allow asynchronous signals to be executed immediately if they
          * come in while we are waiting for client input. (This must be
diff --git a/src/backend/utils/misc/README b/src/backend/utils/misc/README
index 6e294386f7..42ae6c1a8f 100644
--- a/src/backend/utils/misc/README
+++ b/src/backend/utils/misc/README
@@ -169,10 +169,14 @@ Entry to a function with a SET option:
 Plain SET command:
 
     If no stack entry of current level:
-        Push new stack entry w/prior value and state SET
+        Push new stack entry w/prior value and state SET or
+        push new stack entry w/o value and state NONXACT.
     else if stack entry's state is SAVE, SET, or LOCAL:
         change stack state to SET, don't change saved value
         (here we are forgetting effects of prior set action)
+    else if stack entry's state is NONXACT:
+        change stack state to NONXACT_SET, set the current value to
+        prior.
     else (entry must have state SET+LOCAL):
         discard its masked value, change state to SET
         (here we are forgetting effects of prior SET and SET LOCAL)
@@ -185,13 +189,20 @@ SET LOCAL command:
     else if stack entry's state is SAVE or LOCAL or SET+LOCAL:
         no change to stack entry
         (in SAVE case, SET LOCAL will be forgotten at func exit)
+    else if stack entry's state is NONXACT:
+        set current value to both prior and masked slots. set state
+        NONXACT+LOCAL.
     else (entry must have state SET):
         put current active into its masked slot, set state SET+LOCAL
     Now set new value.
 
+Setting by NONXACT action (no command exists):
+    Always blow away existing stack then create a new NONXACT entry.    
+
 Transaction or subtransaction abort:
 
-    Pop stack entries, restoring prior value, until top < subxact depth
+    Pop stack entries, restoring prior value unless the stack entry's
+    state is NONXACT, until top < subxact depth
 
 Transaction or subtransaction commit (incl. successful function exit):
 
@@ -199,9 +210,9 @@ Transaction or subtransaction commit (incl. successful function exit):
 
         if entry's state is SAVE:
             pop, restoring prior value
-        else if level is 1 and entry's state is SET+LOCAL:
+        else if level is 1 and entry's state is SET+LOCAL or NONXACT+LOCAL:
             pop, restoring *masked* value
-        else if level is 1 and entry's state is SET:
+        else if level is 1 and entry's state is SET or NONXACT+SET:
             pop, discarding old value
         else if level is 1 and entry's state is LOCAL:
             pop, restoring prior value
@@ -210,9 +221,9 @@ Transaction or subtransaction commit (incl. successful function exit):
         else
             merge entries of level N-1 and N as specified below
 
-The merged entry will have level N-1 and prior = older prior, so easiest
-to keep older entry and free newer.  There are 12 possibilities since
-we already handled level N state = SAVE:
+The merged entry will have level N-1 and prior = older prior, so
+easiest to keep older entry and free newer.  Disregarding to NONXACT,
+here are 12 possibilities since we already handled level N state = SAVE:
 
 N-1        N
 
@@ -232,6 +243,7 @@ SET+LOCAL    SET        discard top prior and second masked, state SET
 SET+LOCAL    LOCAL        discard top prior, no change to stack entry
 SET+LOCAL    SET+LOCAL    discard top prior, copy masked, state S+L
 
+(TODO: states involving NONXACT)
 
 RESET is executed like a SET, but using the reset_val as the desired new
 value.  (We do not provide a RESET LOCAL command, but SET LOCAL TO DEFAULT
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c4a1616136..2eed732d2a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,6 +202,37 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context,
                           bool applySettings, int elevel);
 
 
+/* Enum and struct to command GUC setting to another backend */
+typedef enum
+{
+    REMGUC_VACANT,
+    REMGUC_REQUEST,
+    REMGUC_INPROCESS,
+    REMGUC_DONE,
+    REMGUC_CANCELING,
+    REMGUC_CANCELED,
+} remote_guc_status;
+
+#define GUC_REMOTE_MAX_VALUE_LEN  1024        /* an arbitrary value */
+#define GUC_REMOTE_CANCEL_TIMEOUT 5000        /* in milliseconds */
+
+typedef struct
+{
+    remote_guc_status     state;
+    char name[NAMEDATALEN];
+    char value[GUC_REMOTE_MAX_VALUE_LEN];
+    int     sourcepid;
+    int     targetpid;
+    Oid     userid;
+    bool success;
+    volatile Latch *sender_latch;
+    LWLock    lock;
+} GucRemoteSetting;
+
+static GucRemoteSetting *remote_setting;
+
+volatile bool RemoteGucChangePending = false;
+
 /*
  * Options for enum values defined in this module.
  *
@@ -3084,7 +3115,7 @@ static struct config_int ConfigureNamesInt[] =
         },
         &pgstat_track_syscache_usage_interval,
         0, 0, INT_MAX / 2,
-        NULL, NULL, NULL
+        NULL, &pgstat_track_syscache_assign_hook, NULL
     },
 
     {
@@ -4491,7 +4522,6 @@ discard_stack_value(struct config_generic *gconf, config_var_value *val)
     set_extra_field(gconf, &(val->extra), NULL);
 }
 
-
 /*
  * Fetch the sorted array pointer (exported for help_config.c's use ONLY)
  */
@@ -5283,6 +5313,22 @@ push_old_value(struct config_generic *gconf, GucAction action)
 
     /* Do we already have a stack entry of the current nest level? */
     stack = gconf->stack;
+
+    /* NONXACT action make existing stack useles */
+    if (action == GUC_ACTION_NONXACT)
+    {
+        while (stack)
+        {
+            GucStack *prev = stack->prev;
+
+            discard_stack_value(gconf, &stack->prior);
+            discard_stack_value(gconf, &stack->masked);
+            pfree(stack);
+            stack = prev;
+        }
+        stack = gconf->stack = NULL;
+    }
+
     if (stack && stack->nest_level >= GUCNestLevel)
     {
         /* Yes, so adjust its state if necessary */
@@ -5290,28 +5336,63 @@ push_old_value(struct config_generic *gconf, GucAction action)
         switch (action)
         {
             case GUC_ACTION_SET:
-                /* SET overrides any prior action at same nest level */
-                if (stack->state == GUC_SET_LOCAL)
+                if (stack->state == GUC_NONXACT)
                 {
-                    /* must discard old masked value */
-                    discard_stack_value(gconf, &stack->masked);
+                    /* NONXACT rollbacks to the current value */
+                    stack->scontext = gconf->scontext;
+                    set_stack_value(gconf, &stack->prior);
+                    stack->state = GUC_NONXACT_SET;
                 }
-                stack->state = GUC_SET;
+                else 
+                {
+                    /* SET overrides other prior actions at same nest level */
+                    if (stack->state == GUC_SET_LOCAL)
+                    {
+                        /* must discard old masked value */
+                        discard_stack_value(gconf, &stack->masked);
+                    }
+                    stack->state = GUC_SET;
+                }
+
                 break;
+
             case GUC_ACTION_LOCAL:
                 if (stack->state == GUC_SET)
                 {
-                    /* SET followed by SET LOCAL, remember SET's value */
+                    /* SET followed by SET LOCAL, remember it's value */
                     stack->masked_scontext = gconf->scontext;
                     set_stack_value(gconf, &stack->masked);
                     stack->state = GUC_SET_LOCAL;
                 }
+                else if (stack->state == GUC_NONXACT)
+                {
+                    /*
+                     * NONXACT followed by SET LOCAL, both prior and masked
+                     * are set to the current value
+                     */
+                    stack->scontext = gconf->scontext;
+                    set_stack_value(gconf, &stack->prior);
+                    stack->masked_scontext = stack->scontext;
+                    stack->masked = stack->prior;
+                    stack->state = GUC_NONXACT_LOCAL;
+                }
+                else if (stack->state == GUC_NONXACT_SET)
+                {
+                    /* NONXACT_SET followed by SET LOCAL, set masked */
+                    stack->masked_scontext = gconf->scontext;
+                    set_stack_value(gconf, &stack->masked);
+                    stack->state = GUC_NONXACT_LOCAL;
+                }
                 /* in all other cases, no change to stack entry */
                 break;
             case GUC_ACTION_SAVE:
                 /* Could only have a prior SAVE of same variable */
                 Assert(stack->state == GUC_SAVE);
                 break;
+
+            case GUC_ACTION_NONXACT:
+                Assert(false);
+                break;
         }
         Assert(guc_dirty);        /* must be set already */
         return;
@@ -5327,6 +5408,7 @@ push_old_value(struct config_generic *gconf, GucAction action)
 
     stack->prev = gconf->stack;
     stack->nest_level = GUCNestLevel;
+        
     switch (action)
     {
         case GUC_ACTION_SET:
@@ -5338,10 +5420,15 @@ push_old_value(struct config_generic *gconf, GucAction action)
         case GUC_ACTION_SAVE:
             stack->state = GUC_SAVE;
             break;
+        case GUC_ACTION_NONXACT:
+            stack->state = GUC_NONXACT;
+            break;
     }
     stack->source = gconf->source;
     stack->scontext = gconf->scontext;
-    set_stack_value(gconf, &stack->prior);
+
+    if (action != GUC_ACTION_NONXACT)
+        set_stack_value(gconf, &stack->prior);
 
     gconf->stack = stack;
 
@@ -5436,22 +5523,31 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
              * stack entries to avoid leaking memory.  If we do set one of
              * those flags, unused fields will be cleaned up after restoring.
              */
-            if (!isCommit)        /* if abort, always restore prior value */
-                restorePrior = true;
+            if (!isCommit)
+            {
+                /* GUC_NONXACT does't rollback */
+                if (stack->state != GUC_NONXACT)
+                    restorePrior = true;
+            }
             else if (stack->state == GUC_SAVE)
                 restorePrior = true;
             else if (stack->nest_level == 1)
             {
                 /* transaction commit */
-                if (stack->state == GUC_SET_LOCAL)
+                if (stack->state == GUC_SET_LOCAL ||
+                    stack->state == GUC_NONXACT_LOCAL)
                     restoreMasked = true;
-                else if (stack->state == GUC_SET)
+                else if (stack->state == GUC_SET ||
+                         stack->state == GUC_NONXACT_SET)
                 {
                     /* we keep the current active value */
                     discard_stack_value(gconf, &stack->prior);
                 }
-                else            /* must be GUC_LOCAL */
+                else if (stack->state != GUC_NONXACT)
+                {
+                    /* must be GUC_LOCAL */
                     restorePrior = true;
+                }
             }
             else if (prev == NULL ||
                      prev->nest_level < stack->nest_level - 1)
@@ -5473,11 +5569,27 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                         break;
 
                     case GUC_SET:
-                        /* next level always becomes SET */
-                        discard_stack_value(gconf, &stack->prior);
-                        if (prev->state == GUC_SET_LOCAL)
+                        if (prev->state == GUC_SET ||
+                            prev->state == GUC_NONXACT_SET)
+                        {
+                            discard_stack_value(gconf, &stack->prior);
+                        }
+                        else if (prev->state == GUC_NONXACT)
+                        {
+                            prev->scontext = stack->scontext;
+                            prev->prior = stack->prior;
+                            prev->state = GUC_NONXACT_SET;
+                        }
+                        else if (prev->state == GUC_SET_LOCAL ||
+                                 prev->state == GUC_NONXACT_LOCAL)
+                        {
+                            discard_stack_value(gconf, &stack->prior);
                             discard_stack_value(gconf, &prev->masked);
-                        prev->state = GUC_SET;
+                            if (prev->state == GUC_SET_LOCAL)
+                                prev->state = GUC_SET;
+                            else
+                                prev->state = GUC_NONXACT_SET;
+                        }
                         break;
 
                     case GUC_LOCAL:
@@ -5488,6 +5600,16 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                             prev->masked = stack->prior;
                             prev->state = GUC_SET_LOCAL;
                         }
+                        else if (prev->state == GUC_NONXACT)
+                        {
+                            prev->prior = stack->masked;
+                            prev->scontext = stack->masked_scontext;
+                            prev->masked = stack->masked;
+                            prev->masked_scontext = stack->masked_scontext;
+                            discard_stack_value(gconf, &stack->prior);
+                            discard_stack_value(gconf, &stack->masked);
+                            prev->state = GUC_NONXACT_SET;
+                        }
                         else
                         {
                             /* else just forget this stack level */
@@ -5496,15 +5618,32 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                         break;
 
                     case GUC_SET_LOCAL:
-                        /* prior state at this level no longer wanted */
-                        discard_stack_value(gconf, &stack->prior);
-                        /* copy down the masked state */
-                        prev->masked_scontext = stack->masked_scontext;
-                        if (prev->state == GUC_SET_LOCAL)
-                            discard_stack_value(gconf, &prev->masked);
-                        prev->masked = stack->masked;
-                        prev->state = GUC_SET_LOCAL;
+                        if (prev->state == GUC_NONXACT)
+                        {
+                            prev->prior = stack->prior;
+                            prev->masked = stack->prior;
+                            discard_stack_value(gconf, &stack->prior);
+                            discard_stack_value(gconf, &stack->masked);
+                            prev->state = GUC_NONXACT_SET;
+                        }
+                        else if (prev->state != GUC_NONXACT_SET)
+                        {
+                            /* prior state at this level no longer wanted */
+                            discard_stack_value(gconf, &stack->prior);
+                            /* copy down the masked state */
+                            prev->masked_scontext = stack->masked_scontext;
+                            if (prev->state == GUC_SET_LOCAL)
+                                discard_stack_value(gconf, &prev->masked);
+                            prev->masked = stack->masked;
+                            prev->state = GUC_SET_LOCAL;
+                        }
                         break;
+                    case GUC_NONXACT:
+                    case GUC_NONXACT_SET:
+                    case GUC_NONXACT_LOCAL:
+                        Assert(false);
+                        break;
+                        
                 }
             }
 
@@ -7785,7 +7924,8 @@ set_config_by_name(PG_FUNCTION_ARGS)
     char       *name;
     char       *value;
     char       *new_value;
-    bool        is_local;
+    int            set_action = GUC_ACTION_SET;
+
 
     if (PG_ARGISNULL(0))
         ereport(ERROR,
@@ -7805,18 +7945,27 @@ set_config_by_name(PG_FUNCTION_ARGS)
      * Get the desired state of is_local. Default to false if provided value
      * is NULL
      */
-    if (PG_ARGISNULL(2))
-        is_local = false;
-    else
-        is_local = PG_GETARG_BOOL(2);
+    if (!PG_ARGISNULL(2) && PG_GETARG_BOOL(2))
+        set_action = GUC_ACTION_LOCAL;
+
+    /*
+     * Get the desired state of is_nonxact. Default to false if provided value
+     * is NULL
+     */
+    if (!PG_ARGISNULL(3) && PG_GETARG_BOOL(3))
+    {
+        if (set_action == GUC_ACTION_LOCAL)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("Only one of is_local and is_nonxact can be true")));
+        set_action = GUC_ACTION_NONXACT;
+    }
 
     /* Note SET DEFAULT (argstring == NULL) is equivalent to RESET */
     (void) set_config_option(name,
                              value,
                              (superuser() ? PGC_SUSET : PGC_USERSET),
-                             PGC_S_SESSION,
-                             is_local ? GUC_ACTION_LOCAL : GUC_ACTION_SET,
-                             true, 0, false);
+                             PGC_S_SESSION, set_action, true, 0, false);
 
     /* get the new current value */
     new_value = GetConfigOptionByName(name, NULL, false);
@@ -7825,7 +7974,6 @@ set_config_by_name(PG_FUNCTION_ARGS)
     PG_RETURN_TEXT_P(cstring_to_text(new_value));
 }
 
-
 /*
  * Common code for DefineCustomXXXVariable subroutines: allocate the
  * new variable's config struct and fill in generic fields.
@@ -8024,6 +8172,13 @@ reapply_stacked_values(struct config_generic *variable,
                                          WARNING, false);
                 break;
 
+            case GUC_NONXACT:
+                (void) set_config_option(name, curvalue,
+                                         curscontext, cursource,
+                                         GUC_ACTION_NONXACT, true,
+                                         WARNING, false);
+                break;
+
             case GUC_LOCAL:
                 (void) set_config_option(name, curvalue,
                                          curscontext, cursource,
@@ -8043,6 +8198,33 @@ reapply_stacked_values(struct config_generic *variable,
                                          GUC_ACTION_LOCAL, true,
                                          WARNING, false);
                 break;
+
+            case GUC_NONXACT_SET:
+                /* first, apply the masked value as SET */
+                (void) set_config_option(name, stack->masked.val.stringval,
+                                         stack->masked_scontext, PGC_S_SESSION,
+                                         GUC_ACTION_NONXACT, true,
+                                         WARNING, false);
+                /* then apply the current value as LOCAL */
+                (void) set_config_option(name, curvalue,
+                                         curscontext, cursource,
+                                         GUC_ACTION_SET, true,
+                                         WARNING, false);
+                break;
+
+            case GUC_NONXACT_LOCAL:
+                /* first, apply the masked value as SET */
+                (void) set_config_option(name, stack->masked.val.stringval,
+                                         stack->masked_scontext, PGC_S_SESSION,
+                                         GUC_ACTION_NONXACT, true,
+                                         WARNING, false);
+                /* then apply the current value as LOCAL */
+                (void) set_config_option(name, curvalue,
+                                         curscontext, cursource,
+                                         GUC_ACTION_LOCAL, true,
+                                         WARNING, false);
+                break;
+
         }
 
         /* If we successfully made a stack entry, adjust its nest level */
@@ -10021,6 +10203,373 @@ GUCArrayReset(ArrayType *array)
     return newarray;
 }
 
+Size
+GucShmemSize(void)
+{
+    Size size;
+
+    size = sizeof(GucRemoteSetting);
+
+    return size;
+}
+
+void
+GucShmemInit(void)
+{
+    Size    size;
+    bool    found;
+
+    size = sizeof(GucRemoteSetting);
+    remote_setting = (GucRemoteSetting *)
+        ShmemInitStruct("GUC remote setting", size, &found);
+
+    if (!found)
+    {
+        MemSet(remote_setting, 0, size);
+        LWLockInitialize(&remote_setting->lock, LWLockNewTrancheId());
+    }
+
+    LWLockRegisterTranche(remote_setting->lock.tranche, "guc_remote");
+}
+
+/*
+ * set_backend_config: SQL callable function to set GUC variable of remote
+ * session.
+ */
+Datum
+set_backend_config(PG_FUNCTION_ARGS)
+{
+    int        pid   = PG_GETARG_INT32(0);
+    char   *name  = text_to_cstring(PG_GETARG_TEXT_P(1));
+    char   *value = text_to_cstring(PG_GETARG_TEXT_P(2));
+    TimestampTz    cancel_start;
+    PgBackendStatus *beentry;
+    int beid;
+    int rc;
+
+    if (strlen(name) >= NAMEDATALEN)
+        ereport(ERROR,
+                (errcode(ERRCODE_NAME_TOO_LONG),
+                 errmsg("name of GUC variable is too long")));
+    if (strlen(value) >= GUC_REMOTE_MAX_VALUE_LEN)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("value is too long"),
+                 errdetail("Maximum acceptable length of value is %d",
+                     GUC_REMOTE_MAX_VALUE_LEN - 1)));
+
+    /* find beentry for given pid */
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * This will be checked out by SendProcSignal but do here to emit
+     * appropriate message message.
+     */
+    if (!beentry)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("process PID %d not found", pid)));
+
+    /* allow only client backends */
+    if (beentry->st_backendType != B_BACKEND)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("not a client backend")));
+    
+    /*
+     * Wait if someone is sending a request. We need to wait with timeout
+     * since the current user of the struct doesn't wake me up.
+     */
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    while (remote_setting->state != REMGUC_VACANT)
+    {
+        LWLockRelease(&remote_setting->lock);
+        rc = WaitLatch(&MyProc->procLatch,
+                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                       200, PG_WAIT_ACTIVITY);
+
+        if (rc & WL_POSTMASTER_DEATH)
+            return (Datum) BoolGetDatum(false);
+
+        CHECK_FOR_INTERRUPTS();
+
+        LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    }
+
+    /* my turn, send a request */
+    Assert(remote_setting->state == REMGUC_VACANT);
+
+    remote_setting->state = REMGUC_REQUEST;
+    remote_setting->sourcepid = MyProcPid;
+    remote_setting->targetpid = pid;
+    remote_setting->userid = GetUserId();
+
+    strncpy(remote_setting->name, name, NAMEDATALEN);
+    remote_setting->name[NAMEDATALEN - 1] = 0;
+    strncpy(remote_setting->value, value, GUC_REMOTE_MAX_VALUE_LEN);
+    remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0;
+    remote_setting->sender_latch = MyLatch;
+
+    LWLockRelease(&remote_setting->lock);
+
+    if (SendProcSignal(pid, PROCSIG_REMOTE_GUC, InvalidBackendId) < 0)
+    {
+        remote_setting->state = REMGUC_VACANT;
+        ereport(ERROR,
+                (errmsg("could not signal backend with PID %d: %m", pid)));
+    }
+
+    /*
+     * This request is processed only while idle time of peer so it may take a
+     * long time before we get a response.
+     */
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    while (remote_setting->state != REMGUC_DONE)
+    {
+        LWLockRelease(&remote_setting->lock);
+        rc = WaitLatch(&MyProc->procLatch,
+                       WL_LATCH_SET | WL_POSTMASTER_DEATH,
+                       -1, PG_WAIT_ACTIVITY);
+
+        /* don't care of the state in the case.. */
+        if (rc & WL_POSTMASTER_DEATH)
+            return (Datum) BoolGetDatum(false);
+
+        LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+
+        /* get out if we got a query cancel request */
+        if (QueryCancelPending)
+            break;
+    }
+
+    /*
+     * Cancel the requset if possible. We cannot cancel the request in the
+     * case peer have processed it. We don't see QueryCancelPending but the
+     * request status so that the case is handled properly.
+     */
+    if (remote_setting->state == REMGUC_REQUEST)
+    {
+        Assert(QueryCancelPending);
+
+        remote_setting->state = REMGUC_CANCELING;
+        LWLockRelease(&remote_setting->lock);
+
+        if (SendProcSignal(pid,
+                           PROCSIG_REMOTE_GUC, InvalidBackendId) < 0)
+        {
+            remote_setting->state = REMGUC_VACANT;
+            ereport(ERROR,
+                    (errmsg("could not signal backend with PID %d: %m",
+                            pid)));
+        }
+
+        /* Peer must respond shortly, don't sleep for a long time. */
+        
+        cancel_start = GetCurrentTimestamp();
+
+        LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+        while (remote_setting->state != REMGUC_CANCELED &&
+               !TimestampDifferenceExceeds(cancel_start, GetCurrentTimestamp(),
+                                           GUC_REMOTE_CANCEL_TIMEOUT))
+        {
+            LWLockRelease(&remote_setting->lock);
+            rc = WaitLatch(&MyProc->procLatch,
+                           WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                           GUC_REMOTE_CANCEL_TIMEOUT, PG_WAIT_ACTIVITY);
+
+            /* don't care of the state in the case.. */
+            if (rc & WL_POSTMASTER_DEATH)
+                return (Datum) BoolGetDatum(false);
+
+            LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+        }
+
+        if (remote_setting->state != REMGUC_CANCELED)
+        {
+            remote_setting->state = REMGUC_VACANT;
+            ereport(ERROR, (errmsg("failed cancelling remote GUC request")));
+        }
+
+        remote_setting->state = REMGUC_VACANT;
+        LWLockRelease(&remote_setting->lock);
+
+        ereport(INFO,
+                (errmsg("remote GUC change request to PID %d is canceled",
+                              pid)));
+
+        return (Datum) BoolGetDatum(false);
+    }
+
+    Assert (remote_setting->state == REMGUC_DONE);
+
+    /* ereport exits on query cancel, we need this before that */
+    remote_setting->state = REMGUC_VACANT;
+
+    if (QueryCancelPending)
+        ereport(INFO,
+                (errmsg("remote GUC change request to PID %d already completed",
+                        pid)));
+                
+    if (!remote_setting->success)
+        ereport(ERROR,
+                (errmsg("%s", remote_setting->value)));
+
+    LWLockRelease(&remote_setting->lock);
+
+    return (Datum) BoolGetDatum(true);
+}
+
+
+void
+HandleRemoteGucSetInterrupt(void)
+{
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+
+    /* check if any request is being sent to me */
+    if (remote_setting->targetpid == MyProcPid)
+    {
+        switch (remote_setting->state)
+        {
+        case REMGUC_REQUEST:
+            InterruptPending = true;
+            RemoteGucChangePending = true;
+            break;
+        case REMGUC_CANCELING:
+            InterruptPending = true;
+            RemoteGucChangePending = true;
+            remote_setting->state = REMGUC_CANCELED;
+            SetLatch(remote_setting->sender_latch);
+            break;
+        default:
+            break;
+        }
+    }
+    LWLockRelease(&remote_setting->lock);
+}
+
+void
+HandleGucRemoteChanges(void)
+{
+    MemoryContext currentcxt = CurrentMemoryContext;
+    bool    canceling = false;
+    bool    process_request = true;
+    int        saveInterruptHoldoffCount = 0;
+    int        saveQueryCancelHoldoffCount = 0;
+
+    RemoteGucChangePending = false;
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+
+    /* skip if this request is no longer for me */
+    if (remote_setting->targetpid != MyProcPid)
+        process_request = false;
+    else
+    {
+        switch (remote_setting->state)
+        {
+        case REMGUC_REQUEST:
+            remote_setting->state = REMGUC_INPROCESS;
+            break;
+        case REMGUC_CANCELING:
+            /*
+             * This request is already canceled but entered this function
+             * before receiving signal. Cancel the request here.
+             */
+            remote_setting->state = REMGUC_CANCELED;
+            remote_setting->success = false;
+            canceling = true;
+            break;
+        case REMGUC_VACANT:
+        case REMGUC_CANCELED:
+        case REMGUC_INPROCESS:
+        case REMGUC_DONE:
+            /* Just ignore the cases */
+            process_request = false;
+            break;
+        }
+    }
+
+    LWLockRelease(&remote_setting->lock);
+
+    if (!process_request)
+        return;
+
+    if (canceling)
+    {
+        SetLatch(remote_setting->sender_latch);
+        return;
+    }
+
+
+    /* Okay, actually modify variable */
+    remote_setting->success = true;
+
+    PG_TRY();
+    {
+        bool     has_privilege;
+        bool     is_superuser;
+        bool end_transaction = false;
+        /*
+         * XXXX: ERROR resets the following varialbes but we don't want that.
+         */
+        saveInterruptHoldoffCount = InterruptHoldoffCount;
+        saveQueryCancelHoldoffCount = QueryCancelHoldoffCount;
+
+        /* superuser_arg requires a transaction */
+        if (!IsTransactionState())
+        {
+            StartTransactionCommand();
+            end_transaction  = true;
+        }
+        is_superuser = superuser_arg(remote_setting->userid);
+        has_privilege = is_superuser ||
+            has_privs_of_role(remote_setting->userid, GetUserId());
+
+        if (end_transaction)
+            CommitTransactionCommand();
+
+        if (!has_privilege)
+            elog(ERROR, "role %u is not allowed to set GUC variables on the session with PID %d",
+                 remote_setting->userid, MyProcPid);
+        
+        (void) set_config_option(remote_setting->name, remote_setting->value,
+                                 is_superuser ? PGC_SUSET : PGC_USERSET,
+                                 PGC_S_SESSION, GUC_ACTION_NONXACT,
+                                 true, ERROR, false);
+    }
+    PG_CATCH();
+    {
+        ErrorData *errdata;
+        MemoryContextSwitchTo(currentcxt);
+        errdata = CopyErrorData();
+        remote_setting->success = false;
+        strncpy(remote_setting->value, errdata->message,
+                GUC_REMOTE_MAX_VALUE_LEN);
+        remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0;
+        FlushErrorState();
+
+        /* restore the saved value */
+        InterruptHoldoffCount = saveInterruptHoldoffCount ;
+        QueryCancelHoldoffCount = saveQueryCancelHoldoffCount;
+        
+    }
+    PG_END_TRY();
+
+    ereport(LOG,
+            (errmsg("GUC variable \"%s\" is changed to \"%s\" by request from another backend with PID %d",
+                    remote_setting->name, remote_setting->value,
+                    remote_setting->sourcepid)));
+
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    remote_setting->state = REMGUC_DONE;
+    LWLockRelease(&remote_setting->lock);
+
+    SetLatch(remote_setting->sender_latch);
+}
+
 /*
  * Validate a proposed option setting for GUCArrayAdd/Delete/Reset.
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 86c84c7cf4..cf1c37aa9e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5638,8 +5638,8 @@
   proargtypes => 'text bool', prosrc => 'show_config_by_name_missing_ok' },
 { oid => '2078', descr => 'SET X as a function',
   proname => 'set_config', proisstrict => 'f', provolatile => 'v',
-  proparallel => 'u', prorettype => 'text', proargtypes => 'text text bool',
-  prosrc => 'set_config_by_name' },
+  proparallel => 'u', prorettype => 'text',
+  proargtypes => 'text text bool bool', prosrc => 'set_config_by_name' },
 { oid => '2084', descr => 'SHOW ALL as a function',
   proname => 'pg_show_all_settings', prorows => '1000', proretset => 't',
   provolatile => 's', prorettype => 'record', proargtypes => '',
@@ -9612,6 +9612,12 @@
   proargmodes => '{i,o,o,o,o,o,o,o,o,o,o}',
   proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,nentries,last_update}',
   prosrc => 'pgstat_get_syscache_stats' },
+{ oid => '3424',
+  descr => 'set config of another backend',
+  proname => 'pg_set_backend_config', proisstrict => 'f',
+  proretset => 'f', provolatile => 'v', proparallel => 'u',
+  prorettype => 'bool', proargtypes => 'int4 text text',
+  prosrc => 'set_backend_config' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b64bc499e4..4e341c93ed 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_REMOTE_GUC
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 6db0d69b71..4ad4927d3d 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,9 @@ typedef enum
     PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
     PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+    /* Remote GUC setting */
+    PROCSIG_REMOTE_GUC,
+
     NUM_PROCSIGNALS                /* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f462eabe59..1766e64165 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -193,7 +193,8 @@ typedef enum
     /* Types of set_config_option actions */
     GUC_ACTION_SET,                /* regular SET command */
     GUC_ACTION_LOCAL,            /* SET LOCAL command */
-    GUC_ACTION_SAVE                /* function SET option, or temp assignment */
+    GUC_ACTION_SAVE,            /* function SET option, or temp assignment */
+    GUC_ACTION_NONXACT            /* transactional setting */
 } GucAction;
 
 #define GUC_QUALIFIER_SEPARATOR '.'
@@ -269,6 +270,8 @@ extern int    tcp_keepalives_idle;
 extern int    tcp_keepalives_interval;
 extern int    tcp_keepalives_count;
 
+extern volatile bool RemoteGucChangePending;
+
 #ifdef TRACE_SORT
 extern bool trace_sort;
 #endif
@@ -276,6 +279,11 @@ extern bool trace_sort;
 /*
  * Functions exported by guc.c
  */
+extern Size GucShmemSize(void);
+extern void GucShmemInit(void);
+extern Datum set_backend_setting(PG_FUNCTION_ARGS);
+extern void HandleRemoteGucSetInterrupt(void);
+extern void HandleGucRemoteChanges(void);
 extern void SetConfigOption(const char *name, const char *value,
                 GucContext context, GucSource source);
 
@@ -395,6 +403,9 @@ extern Size EstimateGUCStateSpace(void);
 extern void SerializeGUCState(Size maxsize, char *start_address);
 extern void RestoreGUCState(void *gucstate);
 
+/* Remote GUC setting */
+extern void HandleGucRemoteChanges(void);
+
 /* Support for messages reported from GUC check hooks */
 
 extern PGDLLIMPORT char *GUC_check_errmsg_string;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 668d9efd35..7a2396d2f5 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -113,7 +113,10 @@ typedef enum
     GUC_SAVE,                    /* entry caused by function SET option */
     GUC_SET,                    /* entry caused by plain SET command */
     GUC_LOCAL,                    /* entry caused by SET LOCAL command */
-    GUC_SET_LOCAL                /* entry caused by SET then SET LOCAL */
+    GUC_NONXACT,                /* entry caused by non-transactional ops */
+    GUC_SET_LOCAL,                /* entry caused by SET then SET LOCAL */
+    GUC_NONXACT_SET,            /* entry caused by NONXACT then SET */
+    GUC_NONXACT_LOCAL            /* entry caused by NONXACT then (SET)LOCAL */
 } GucStackState;
 
 typedef struct guc_stack
diff --git a/src/test/regress/expected/guc.out b/src/test/regress/expected/guc.out
index 43ac5f5f11..2c074705c7 100644
--- a/src/test/regress/expected/guc.out
+++ b/src/test/regress/expected/guc.out
@@ -476,6 +476,229 @@ SELECT '2006-08-13 12:34:56'::timestamptz;
  2006-08-13 12:34:56-07
 (1 row)
 
+-- NONXACT followed by SET, SET LOCAL through COMMIT
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+ work_mem 
+----------
+ 512kB
+(1 row)
+
+COMMIT;
+SHOW work_mem;    -- must see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+-- NONXACT followed by SET, SET LOCAL through ROLLBACK
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+ work_mem 
+----------
+ 512kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+-- SET, SET LOCAL followed by NONXACT through COMMIT
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+COMMIT;
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+-- SET, SET LOCAL followed by NONXACT through ROLLBACK
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+-- NONXACT and SAVEPOINT
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+ work_mem 
+----------
+ 384kB
+(1 row)
+
+COMMIT;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+ work_mem 
+----------
+ 384kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+COMMIT;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET work_mem TO DEFAULT;
 --
 -- Test RESET.  We use datestyle because the reset value is forced by
 -- pg_regress, so it doesn't depend on the installation's configuration.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 735dd37acf..3569edc22d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1918,6 +1918,30 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.nentries,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.nentries,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,nentries, last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2349,7 +2373,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
diff --git a/src/test/regress/sql/guc.sql b/src/test/regress/sql/guc.sql
index 23e5029780..2fb23caafe 100644
--- a/src/test/regress/sql/guc.sql
+++ b/src/test/regress/sql/guc.sql
@@ -133,6 +133,94 @@ SHOW vacuum_cost_delay;
 SHOW datestyle;
 SELECT '2006-08-13 12:34:56'::timestamptz;
 
+-- NONXACT followed by SET, SET LOCAL through COMMIT
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+COMMIT;
+SHOW work_mem;    -- must see 256kB
+
+-- NONXACT followed by SET, SET LOCAL through ROLLBACK
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+
+-- SET, SET LOCAL followed by NONXACT through COMMIT
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SHOW work_mem;    -- must see 128kB
+COMMIT;
+SHOW work_mem;    -- must see 128kB
+
+-- SET, SET LOCAL followed by NONXACT through ROLLBACK
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SHOW work_mem;    -- must see 128kB
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+
+-- NONXACT and SAVEPOINT
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+COMMIT;
+SHOW work_mem; -- will see 256kB
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+COMMIT;
+SHOW work_mem; -- will see 256kB
+
+SET work_mem TO DEFAULT;
 --
 -- Test RESET.  We use datestyle because the reset value is forced by
 -- pg_regress, so it doesn't depend on the installation's configuration.
-- 
2.16.3


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
Hello, thank you for updating the patch.


>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
>At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi"
><ideriha.takeshi@jp.fujitsu.com> wrote in
><4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04>
>> >As a *PoC*, in the attached patch (which applies to current master),
>> >size of CTups are counted as the catcache size.
>> >
>> >It also provides pg_catcache_size system view just to give a rough
>> >idea of how such view looks. I'll consider more on that but do you have any opinion
>on this?
>> >
>...
>> Great! I like this view.
>> One of the extreme idea would be adding all the members printed by
>> CatCachePrintStats(), which is only enabled with -DCATCACHE_STATS at this
>moment.
>> All of the members seems too much for customers who tries to change
>> the cache limit size But it may be some of the members are useful
>> because for example cc_hits would indicate that current cache limit size is too small.
>
>The attached introduces four features below. (But the features on relcache and
>plancache are omitted).
I haven't looked into the code but I'm going to do it later.

Right now It seems to me that focusing on catalog cache invalidation and its stats a quick route
to commit this feature. 

>1. syscache stats collector (in 0002)
>
>Records syscache status consists of the same columns above and "ageclass"
>information. We could somehow triggering a stats report with signal but we don't want
>take/send/write the statistics in signal handler. Instead, it is turned on by setting
>track_syscache_usage_interval to a positive number in milliseconds.

I agreed. Agecalss is important to tweak the prune_min_age. 
Collecting stats is heavy at every stats change

>2. pg_stat_syscache view.  (in 0002)
>
>This view shows catcache statistics. Statistics is taken only on the backends where
>syscache tracking is active.
>
>>  pid  | application_name |    relname     |            cache_name
>|   size   |        ageclass         |         nentries
>>
>------+------------------+----------------+-----------------------------------+----------
>+-------------------------+---------------------------
>>  9984 | psql             | pg_statistic   | pg_statistic_relid_att_inh_index  |
>12676096 | {30,60,600,1200,1800,0} | {17660,17310,55870,0,0,0}
>
>Age class is the basis of catcache truncation mechanism and shows the distribution
>based on elapsed time since last access. As I didn't came up an appropriate way, it is
>represented as two arrays.  Ageclass stores maximum age for each class in seconds.
>Nentries holds entry numbers correnponding to the same element in ageclass. In the
>above example,
>
>     age class  : # of entries in the cache
>   up to   30s  : 17660
>   up to   60s  : 17310
>   up to  600s  : 55870
>   up to 1200s  : 0
>   up to 1800s  : 0
>   more longer  : 0
>
> The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of  cache_prune_min_age on the
>backend.

I just thought that the pair of ageclass and nentries can be represented as
json or multi-dimensional array but in virtual they are all same and can be converted each other
using some functions. So I'm not sure which representaion is better one.

>3. non-transactional GUC setting (in 0003)
>
>It allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name
>requires condieration) survive beyond rollback. It is required by remote guc setting to
>work sanely. Without the feature a remote-set value within a trasction will disappear
>involved in rollback. The only local interface for the NONXACT action is
>set_config(name, value, is_local=false, is_nonxact = true). pg_set_backend_guc()
>below works on this feature.

TBH, I'm not familiar with around this and I may be missing something.
In order to change the other backend's GUC value,
is ignoring transactional behevior always necessary? When transaction of GUC setting
is failed and rollbacked, if the error message is supposeed to be reported I thought 
just trying the transaction again is enough.

>4. pg_set_backend_guc() function.
>
>Of course syscache statistics recording consumes significant amount of time so it
>cannot be turned on usually. On the other hand since this feature is turned on by GUC,
>it is needed to grab the active client connection to turn on/off the feature(but we
>cannot). Instead, I provided a means to change GUC variables in another backend.
>
>pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
>on the backend "pid" to "value".
>
>
>
>With the above tools, we can inspect catcache statistics of seemingly bloated process.
>
>A. Find a bloated process pid using ps or something.
>
>B. Turn on syscache stats on the process.
>  =# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', '10000');
>
>C. Examine the statitics.
>
>=# select pid, relname, cache_name, size from pg_stat_syscache order by size desc
>limit 3;
> pid  |   relname    |            cache_name            |   size
>------+--------------+----------------------------------+----------
> 9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112
> 9984 | pg_cast      | pg_cast_source_target_index      |     4096
> 9984 | pg_operator  | pg_operator_oprname_l_r_n_index  |     4096
>
>
>=# select * from pg_stat_syscache where cache_name =
>'pg_statistic_relid_att_inh_index'::regclass;
>-[ RECORD 1 ]---------------------------------
>pid         | 9984
>relname     | pg_statistic
>cache_name  | pg_statistic_relid_att_inh_index
>size        | 11026176
>ntuples     | 77950
>searches    | 77950
>hits        | 0
>neg_hits    | 0
>ageclass    | {30,60,600,1200,1800,0}
>nentries    | {17630,16950,43370,0,0,0}
>last_update | 2018-10-17 15:58:19.738164+09

The output of this view seems good to me.

I can imagine this use case. Does the use case of setting GUC locally never happen? 
I mean can the setting be locally changed?

Regards,
Takeshi Ideriha

RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
>I haven't looked into the code but I'm going to do it later.

Hi, I've taken a look at 0001 patch. Reviewing the rest of patch will be later.

        if (!IsParallelWorker())                                                                                  
+       {                                                                                                         
                stmtStartTimestamp = GetCurrentTimestamp();                                                       
+                                                                                                                 
+               /* Set this timestamp as aproximated current time */                                              
+               SetCatCacheClock(stmtStartTimestamp);                                                             
+       }                                                                                                         
        else    
  
Just confirmation.
At first I thought that when parallel worker is active catcacheclock is not updated.
But when parallel worker is active catcacheclock is updated by the parent and no problem is occurred.

+       int                     tupsize = 0;
                                                  
 

                                                  
 
        /* negative entries have no tuple associated */
                                                  
 
        if (ntp)
                                                  
 
        {
                                                  
 
                int                     i;
                                                  
 
+               int                     tupsize;

+               ct->size = tupsize;

@@ -1906,17 +2051,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
        ct->dead = false;
                                                  
 
        ct->negative = negative;
                                                  
 
        ct->hash_value = hashValue;
                                                  
 
+       ct->naccess = 0;
                                                  
 
+       ct->lastaccess = catcacheclock;
                                                  
 
+       ct->size = tupsize;

tupsize is declared twice inside and outiside of if scope but it doesn't seem you need to do so. 
And ct->size = tupsize is executed twice at if block and outside of if-else block.

+static inline TimestampTz
                                                  
 
+GetCatCacheClock(void)
                                                  
 

This function is not called by anyone in this version of patch. In previous version, this one is called by plancache.
Will further patch focus only on catcache? In this case this one can be removed. 

There are some typos.
+       int                     size;                   /* palloc'ed size off this tuple */ 
typo: off->of

+               /* Set this timestamp as aproximated current time */
typo: aproximated->approximated

+ * GUC variable to define the minimum size of hash to cosider entry eviction.
typo: cosider -> consider

+       /* initilize catcache reference clock if haven't done yet */ 
typo:initilize -> initialize

Regards,
Takeshi Ideriha

Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Thank you for reviewing.

At Thu, 15 Nov 2018 11:02:10 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in
<4E72940DA2BF16479384A86D54D0988A6F1F4165@G01JPEXMBKW04>
> Hello, thank you for updating the patch.
> 
> 
> >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> >At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi"
> ><ideriha.takeshi@jp.fujitsu.com> wrote in
> ><4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04>
> >> >As a *PoC*, in the attached patch (which applies to current master),
> >> >size of CTups are counted as the catcache size.
> >> >
> >> >It also provides pg_catcache_size system view just to give a rough
> >> >idea of how such view looks. I'll consider more on that but do you have any opinion
> >on this?
> >> >
> >...
> >> Great! I like this view.
> >> One of the extreme idea would be adding all the members printed by
> >> CatCachePrintStats(), which is only enabled with -DCATCACHE_STATS at this
> >moment.
> >> All of the members seems too much for customers who tries to change
> >> the cache limit size But it may be some of the members are useful
> >> because for example cc_hits would indicate that current cache limit size is too small.
> >
> >The attached introduces four features below. (But the features on relcache and
> >plancache are omitted).
> I haven't looked into the code but I'm going to do it later.
> 
> Right now It seems to me that focusing on catalog cache invalidation and its stats a quick route
> to commit this feature. 
> 
> >1. syscache stats collector (in 0002)
> >
> >Records syscache status consists of the same columns above and "ageclass"
> >information. We could somehow triggering a stats report with signal but we don't want
> >take/send/write the statistics in signal handler. Instead, it is turned on by setting
> >track_syscache_usage_interval to a positive number in milliseconds.
> 
> I agreed. Agecalss is important to tweak the prune_min_age. 
> Collecting stats is heavy at every stats change
> 
> >2. pg_stat_syscache view.  (in 0002)
> >
> >This view shows catcache statistics. Statistics is taken only on the backends where
> >syscache tracking is active.
> >
> >>  pid  | application_name |    relname     |            cache_name
> >|   size   |        ageclass         |         nentries
> >>
> >------+------------------+----------------+-----------------------------------+----------
> >+-------------------------+---------------------------
> >>  9984 | psql             | pg_statistic   | pg_statistic_relid_att_inh_index  |
> >12676096 | {30,60,600,1200,1800,0} | {17660,17310,55870,0,0,0}
> >
> >Age class is the basis of catcache truncation mechanism and shows the distribution
> >based on elapsed time since last access. As I didn't came up an appropriate way, it is
> >represented as two arrays.  Ageclass stores maximum age for each class in seconds.
> >Nentries holds entry numbers correnponding to the same element in ageclass. In the
> >above example,
> >
> >     age class  : # of entries in the cache
> >   up to   30s  : 17660
> >   up to   60s  : 17310
> >   up to  600s  : 55870
> >   up to 1200s  : 0
> >   up to 1800s  : 0
> >   more longer  : 0
> >
> > The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of  cache_prune_min_age on the
> >backend.
> 
> I just thought that the pair of ageclass and nentries can be represented as
> json or multi-dimensional array but in virtual they are all same and can be converted each other
> using some functions. So I'm not sure which representaion is better one.

Multi dimentional array in any style sounds reasonable. Maybe
array is preferable in system views as it is a basic type than
JSON. In the attached, it looks like the follows:

=# select * from pg_stat_syscache  where ntuples > 100;
-[ RECORD 1 ]--------------------------------------------------
pid         | 1817
relname     | pg_class
cache_name  | pg_class_oid_index
size        | 2048
ntuples     | 189
searches    | 1620
hits        | 1431
neg_hits    | 0
ageclass    | {{30,189},{60,0},{600,0},{1200,0},{1800,0},{0,0}}
last_update | 2018-11-27 19:22:00.74026+09


> >3. non-transactional GUC setting (in 0003)
> >
> >It allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name
> >requires condieration) survive beyond rollback. It is required by remote guc setting to
> >work sanely. Without the feature a remote-set value within a trasction will disappear
> >involved in rollback. The only local interface for the NONXACT action is
> >set_config(name, value, is_local=false, is_nonxact = true). pg_set_backend_guc()
> >below works on this feature.
> 
> TBH, I'm not familiar with around this and I may be missing something.
> In order to change the other backend's GUC value,
> is ignoring transactional behevior always necessary? When transaction of GUC setting
> is failed and rollbacked, if the error message is supposeed to be reported I thought 
> just trying the transaction again is enough.

The target backend can be running frequent transaction.  The
invoker backend cannot know whether the remote change happend
during a transaction and whether the transaction if any is
committed or aborted, no error message sent to invoker backend.
We could wait for the end of a trasaction but that doesn't work
with long transactions.

Maybe we don't need the feature in GUC system but adding another
similar feature doesn't seem reasonable. This would be useful for
some other tracking features.


> >4. pg_set_backend_guc() function.
> >
> >Of course syscache statistics recording consumes significant amount of time so it
> >cannot be turned on usually. On the other hand since this feature is turned on by GUC,
> >it is needed to grab the active client connection to turn on/off the feature(but we
> >cannot). Instead, I provided a means to change GUC variables in another backend.
> >
> >pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
> >on the backend "pid" to "value".
> >
> >
> >
> >With the above tools, we can inspect catcache statistics of seemingly bloated process.
> >
> >A. Find a bloated process pid using ps or something.
> >
> >B. Turn on syscache stats on the process.
> >  =# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', '10000');
> >
> >C. Examine the statitics.
> >
> >=# select pid, relname, cache_name, size from pg_stat_syscache order by size desc
> >limit 3;
> > pid  |   relname    |            cache_name            |   size
> >------+--------------+----------------------------------+----------
> > 9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112
> > 9984 | pg_cast      | pg_cast_source_target_index      |     4096
> > 9984 | pg_operator  | pg_operator_oprname_l_r_n_index  |     4096
> >
> >
> >=# select * from pg_stat_syscache where cache_name =
> >'pg_statistic_relid_att_inh_index'::regclass;
> >-[ RECORD 1 ]---------------------------------
> >pid         | 9984
> >relname     | pg_statistic
> >cache_name  | pg_statistic_relid_att_inh_index
> >size        | 11026176
> >ntuples     | 77950
> >searches    | 77950
> >hits        | 0
> >neg_hits    | 0
> >ageclass    | {30,60,600,1200,1800,0}
> >nentries    | {17630,16950,43370,0,0,0}
> >last_update | 2018-10-17 15:58:19.738164+09
> 
> The output of this view seems good to me.
> 
> I can imagine this use case. Does the use case of setting GUC locally never happen? 
> I mean can the setting be locally changed?

Syscahe grows through a life of a backend/session. No other
client cannot connect to it at the same time. So the variable
must be set at the start of a backend using ALTER USER/DATABASE,
or the client itself is obliged to deliberitely turn on the
feature at a convenient time. I suppose that in most use cases
one wants to turn on this feature after he sees another session
is eating memory more and more.

The attached is the rebased version that has multidimentional
ageclass.

Thank you for the comments in the next mail but sorry that I'll
address them later.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
From 647334b5cb15926db460560c2e1cedbf33715a73 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   5 +
 src/backend/utils/cache/catcache.c            | 166 ++++++++++++++++++++++++--
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  28 ++++-
 6 files changed, 254 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index db1a2d4e74..4f4654120e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400384..71ae0daf17 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -733,7 +733,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     if (!IsParallelWorker())
+    {
         stmtStartTimestamp = GetCurrentTimestamp();
+
+        /* Set this timestamp as aproximated current time */
+        SetCatCacheClock(stmtStartTimestamp);
+    }
     else
         Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index b31fd5acea..09d5a9a520 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -490,6 +505,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -841,6 +857,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -858,9 +875,129 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1274,6 +1411,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1819,11 +1961,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1842,13 +1986,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1876,8 +2021,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1898,17 +2043,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6497393c03..28af4c8795 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -80,6 +80,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2167,6 +2168,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee9ec6a120..6bc1fc3e61 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 7b22f9c7bc..ace4178619 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +193,28 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 0b16d3cbfb6957e61c484fdb2794c49d69d78c9c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 15:48:28 +0900
Subject: [PATCH 2/3] Syscache usage tracking feature.

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  15 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 206 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            | 115 ++++++++++----
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   7 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 17 files changed, 559 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4f4654120e..b8a91d954d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6634,6 +6634,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval">
+      <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect system cache usage statistics in
+        milliseconds. This parameter is 0 by default, which means disabled.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 715995dd88..4f7e12463e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -903,6 +903,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1182,6 +1198,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8676088e57..fc50f10cbb 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/tqual.h"
 
@@ -125,6 +126,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -237,6 +239,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only databsae stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -631,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -645,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -684,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -4286,6 +4311,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6376,3 +6404,163 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and resturns the
+ * time remaining in the current interval in milliseconds. If'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+        return 0;
+
+    
+    /* Check aginst the in*/
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now write the file */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
+
+/*
+ * GUC assignment callback for track_syscache_usage_interval.
+ *
+ * Make a statistics file immedately when syscache statistics is turned
+ * on. Remove it as soon as turned off as well.
+ */
+void
+pgstat_track_syscache_assign_hook(int newval, void *extra)
+{
+    if (newval > 0)
+    {
+        /*
+         * Immediately create a stats file. It's safe since we're not midst
+         * accessing syscache.
+         */
+        pgstat_write_syscache_stats(true);
+    }
+    else
+    {
+        /* Turned off, immediately remove the statsfile */
+        char    fname[MAXPGPATH];
+
+        pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                         fname, MAXPGPATH);
+        unlink(fname);        /* don't care of the result */
+    }
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a3b9757565..f2573fecbd 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3144,6 +3144,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 }
@@ -3720,6 +3726,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_catcache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4160,9 +4167,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_catcache_update_timeout = true;
+                    enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4205,6 +4222,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_catcache_update_timeout)
+        {
+            disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false);
+            disable_idle_catcache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f955f1912a..68e713f254 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1882,3 +1885,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        fread(&cacheid, sizeof(int), 1, fpin);
+        fread(&last_update, sizeof(TimestampTz), 1, fpin);
+        if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 09d5a9a520..50288d444c 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Timestamp used for any operation on caches. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -619,9 +623,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -697,9 +699,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -906,10 +906,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
      * cache_prune_min_age. The index of nremoved_entry is the value of the
      * clock-sweep counter, which takes from 0 up to 2.
      */
-    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
-    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0};
     int            nremoved_entry[3] = {0, 0, 0};
     int            j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
 #endif
 
     /* Return immediately if no pruning is wanted */
@@ -923,7 +924,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
     
-    /* Search the whole hash for entries to remove */
+    /*
+     * Search the whole hash for entries to remove. This is a quite time
+     * consuming task during catcache lookup, but accetable since now we are
+     * going to expand the hash table.
+     */
     for (i = 0; i < cp->cc_nbuckets; i++)
     {
         dlist_mutable_iter iter;
@@ -936,21 +941,21 @@ CatCacheCleanupOldEntries(CatCache *cp)
 
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
 #ifdef CATCACHE_STATS
             /* count catcache entries for each age class */
             ntotal++;
-            for (j = 0 ;
-                 ageclass[j] != 0.0 &&
-                     entry_age > cache_prune_min_age * ageclass[j] ;
-                 j++);
-            if (ageclass[j] == 0.0) j--;
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > cache_prune_min_age * ageclass[j])
+                j++;
             nentries[j]++;
 #endif
 
@@ -983,14 +988,17 @@ CatCacheCleanupOldEntries(CatCache *cp)
     }
 
 #ifdef CATCACHE_STATS
+    StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6,
+                     "number of syscache age class must be 6");
     ereport(DEBUG1,
-            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d)
naccessed(0:%d,1:%d, 2:%d)",
 
                      nremoved, ntotal,
                      ageclass[0] * cache_prune_min_age, nentries[0],
                      ageclass[1] * cache_prune_min_age, nentries[1],
                      ageclass[2] * cache_prune_min_age, nentries[2],
                      ageclass[3] * cache_prune_min_age, nentries[3],
                      ageclass[4] * cache_prune_min_age, nentries[4],
+                     nentries[5],
                      nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
              errhidestmt(true)));
 #endif
@@ -1367,9 +1375,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1429,9 +1435,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1440,9 +1444,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1570,9 +1572,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1683,9 +1683,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1742,9 +1740,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2252,3 +2248,64 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /* cache_prune_min_age can be changed on-session, fill it every time */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index c26808a833..c06ab2a798 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index c6939779b9..5d2276b90c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index b636b1e262..0f57e1a91f 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 28af4c8795..ba0e65f6fb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3130,6 +3130,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6bc1fc3e61..e36ab26bd7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -552,6 +552,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_syscache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 034a41eb55..4de9fdee44 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9616,6 +9616,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index d6b32c070c..15f4d23f0c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d16b8..20add5052c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1134,6 +1134,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1218,7 +1219,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1353,5 +1355,6 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
+extern void pgstat_track_syscache_assign_hook(int newval, void *extra);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index ace4178619..721948b4cc 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -65,10 +65,8 @@ typedef struct catcache
     int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -81,7 +79,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 6f290c7214..c6b10850a9 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index dcc7307c16..e2a9c33f14 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_CATCACHE_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From 50662c1d37e70c1b357ecebab261275e286ca49a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 21:31:22 +0900
Subject: [PATCH 3/3] Remote GUC setting feature and non-xact GUC config.

This adds two features at once. (will be split later).

One is non-transactional GUC setting feature. This allows setting GUC
variable set by the action GUC_ACTION_NONXACT(the name requires
condieration) survive beyond rollback. It is required by remote guc
setting to work sanely. Without the feature a remote-set value within
a trasction will disappear involved in rollback. The only local
interface for the NONXACT action is set_config(name, value,
is_local=false, is_nonxact = true).

The second is remote guc setting feature. It uses ProcSignal to notify
the target server.
---
 doc/src/sgml/config.sgml             |   4 +
 doc/src/sgml/func.sgml               |  30 ++
 src/backend/catalog/system_views.sql |   7 +-
 src/backend/postmaster/pgstat.c      |   3 +
 src/backend/storage/ipc/ipci.c       |   2 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/tcop/postgres.c          |  10 +
 src/backend/utils/misc/README        |  26 +-
 src/backend/utils/misc/guc.c         | 619 +++++++++++++++++++++++++++++++++--
 src/include/catalog/pg_proc.dat      |  10 +-
 src/include/pgstat.h                 |   3 +-
 src/include/storage/procsignal.h     |   3 +
 src/include/utils/guc.h              |  13 +-
 src/include/utils/guc_tables.h       |   5 +-
 src/test/regress/expected/guc.out    | 223 +++++++++++++
 src/test/regress/expected/rules.out  |  26 +-
 src/test/regress/sql/guc.sql         |  88 +++++
 17 files changed, 1027 insertions(+), 49 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b8a91d954d..029642cddb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -281,6 +281,10 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter
      </listitem>
     </itemizedlist>
 
+    <para>
+     Also values on other sessions can be set using the SQL
+     function <function>pg_set_backend_setting</function>.
+    </para>
    </sect2>
 
    <sect2>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 09c77db045..f3e4c8f592 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18694,6 +18694,20 @@ SELECT collation for ('foo' COLLATE "de_DE");
        <entry><type>text</type></entry>
        <entry>set parameter and return new value</entry>
       </row>
+      <row>
+       <entry>
+        <indexterm>
+         <primary>pg_set_backend_setting</primary>
+        </indexterm>
+        <literal><function>pg_set_backend_config(
+                            <parameter>process_id</parameter>,
+                            <parameter>setting_name</parameter>,
+                            <parameter>new_value</parameter>)
+                            </function></literal>
+       </entry>
+       <entry><type>bool</type></entry>
+       <entry>set parameter on another session</entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
@@ -18748,6 +18762,22 @@ SELECT set_config('log_statement_stats', 'off', false);
 ------------
  off
 (1 row)
+</programlisting>
+   </para>
+
+   <para>
+    <function>pg_set_backend_config</function> sets the parameter
+    <parameter>setting_name</parameter> to
+    <parameter>new_value</parameter> on the other session with PID
+    <parameter>process_id</parameter>. The setting is always session-local and
+    returns true if succeeded.  An example:
+<programlisting>
+SELECT pg_set_backend_config(2134, 'work_mem', '16MB');
+
+pg_set_backend_config
+------------
+ t
+(1 row)
 </programlisting>
    </para>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 4f7e12463e..642b7e28d4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -474,7 +474,7 @@ CREATE VIEW pg_settings AS
 CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_settings
     WHERE new.name = old.name DO
-    SELECT set_config(old.name, new.setting, 'f');
+    SELECT set_config(old.name, new.setting, 'f', 'f');
 
 CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_settings
@@ -1048,6 +1048,11 @@ CREATE OR REPLACE FUNCTION
   RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_promote'
   PARALLEL SAFE;
 
+CREATE OR REPLACE FUNCTION set_config (
+        setting_name text, new_value text, is_local boolean, is_nonxact boolean DEFAULT false)
+        RETURNS text STRICT VOLATILE LANGUAGE internal AS 'set_config_by_name'
+        PARALLEL UNSAFE;
+
 -- legacy definition for compatibility with 9.3
 CREATE OR REPLACE FUNCTION
   json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index fc50f10cbb..a1e21f2696 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3707,6 +3707,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_REMOTE_GUC:
+            event_name = "RemoteGUC";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..03d526d12d 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, GucShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     SyncScanShmemInit();
     AsyncShmemInit();
     BackendRandomShmemInit();
+    GucShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index b0dd7d1b37..b897c36bae 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/guc.h"
 
 
 /*
@@ -292,6 +293,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
     if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
         RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+    if (CheckProcSignal(PROCSIG_REMOTE_GUC))
+        HandleRemoteGucSetInterrupt();
+
     SetLatch(MyLatch);
 
     latch_sigusr1_handler();
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f2573fecbd..a891935528 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3152,6 +3152,10 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    /* We don't want chage GUC variables while running a query */
+    if (RemoteGucChangePending && DoingCommandRead)
+        HandleGucRemoteChanges();
 }
 
 
@@ -4188,6 +4192,12 @@ PostgresMain(int argc, char *argv[],
             send_ready_for_query = false;
         }
 
+        /*
+         * (2.5) Process some pending works.
+         */
+        if (RemoteGucChangePending)
+            HandleGucRemoteChanges();
+
         /*
          * (2) Allow asynchronous signals to be executed immediately if they
          * come in while we are waiting for client input. (This must be
diff --git a/src/backend/utils/misc/README b/src/backend/utils/misc/README
index 6e294386f7..42ae6c1a8f 100644
--- a/src/backend/utils/misc/README
+++ b/src/backend/utils/misc/README
@@ -169,10 +169,14 @@ Entry to a function with a SET option:
 Plain SET command:
 
     If no stack entry of current level:
-        Push new stack entry w/prior value and state SET
+        Push new stack entry w/prior value and state SET or
+        push new stack entry w/o value and state NONXACT.
     else if stack entry's state is SAVE, SET, or LOCAL:
         change stack state to SET, don't change saved value
         (here we are forgetting effects of prior set action)
+    else if stack entry's state is NONXACT:
+        change stack state to NONXACT_SET, set the current value to
+        prior.
     else (entry must have state SET+LOCAL):
         discard its masked value, change state to SET
         (here we are forgetting effects of prior SET and SET LOCAL)
@@ -185,13 +189,20 @@ SET LOCAL command:
     else if stack entry's state is SAVE or LOCAL or SET+LOCAL:
         no change to stack entry
         (in SAVE case, SET LOCAL will be forgotten at func exit)
+    else if stack entry's state is NONXACT:
+        set current value to both prior and masked slots. set state
+        NONXACT+LOCAL.
     else (entry must have state SET):
         put current active into its masked slot, set state SET+LOCAL
     Now set new value.
 
+Setting by NONXACT action (no command exists):
+    Always blow away existing stack then create a new NONXACT entry.    
+
 Transaction or subtransaction abort:
 
-    Pop stack entries, restoring prior value, until top < subxact depth
+    Pop stack entries, restoring prior value unless the stack entry's
+    state is NONXACT, until top < subxact depth
 
 Transaction or subtransaction commit (incl. successful function exit):
 
@@ -199,9 +210,9 @@ Transaction or subtransaction commit (incl. successful function exit):
 
         if entry's state is SAVE:
             pop, restoring prior value
-        else if level is 1 and entry's state is SET+LOCAL:
+        else if level is 1 and entry's state is SET+LOCAL or NONXACT+LOCAL:
             pop, restoring *masked* value
-        else if level is 1 and entry's state is SET:
+        else if level is 1 and entry's state is SET or NONXACT+SET:
             pop, discarding old value
         else if level is 1 and entry's state is LOCAL:
             pop, restoring prior value
@@ -210,9 +221,9 @@ Transaction or subtransaction commit (incl. successful function exit):
         else
             merge entries of level N-1 and N as specified below
 
-The merged entry will have level N-1 and prior = older prior, so easiest
-to keep older entry and free newer.  There are 12 possibilities since
-we already handled level N state = SAVE:
+The merged entry will have level N-1 and prior = older prior, so
+easiest to keep older entry and free newer.  Disregarding to NONXACT,
+here are 12 possibilities since we already handled level N state = SAVE:
 
 N-1        N
 
@@ -232,6 +243,7 @@ SET+LOCAL    SET        discard top prior and second masked, state SET
 SET+LOCAL    LOCAL        discard top prior, no change to stack entry
 SET+LOCAL    SET+LOCAL    discard top prior, copy masked, state S+L
 
+(TODO: states involving NONXACT)
 
 RESET is executed like a SET, but using the reset_val as the desired new
 value.  (We do not provide a RESET LOCAL command, but SET LOCAL TO DEFAULT
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba0e65f6fb..15c6e2889d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -216,6 +216,37 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context,
                           bool applySettings, int elevel);
 
 
+/* Enum and struct to command GUC setting to another backend */
+typedef enum
+{
+    REMGUC_VACANT,
+    REMGUC_REQUEST,
+    REMGUC_INPROCESS,
+    REMGUC_DONE,
+    REMGUC_CANCELING,
+    REMGUC_CANCELED,
+} remote_guc_status;
+
+#define GUC_REMOTE_MAX_VALUE_LEN  1024        /* an arbitrary value */
+#define GUC_REMOTE_CANCEL_TIMEOUT 5000        /* in milliseconds */
+
+typedef struct
+{
+    remote_guc_status     state;
+    char name[NAMEDATALEN];
+    char value[GUC_REMOTE_MAX_VALUE_LEN];
+    int     sourcepid;
+    int     targetpid;
+    Oid     userid;
+    bool success;
+    volatile Latch *sender_latch;
+    LWLock    lock;
+} GucRemoteSetting;
+
+static GucRemoteSetting *remote_setting;
+
+volatile bool RemoteGucChangePending = false;
+
 /*
  * Options for enum values defined in this module.
  *
@@ -3137,7 +3168,7 @@ static struct config_int ConfigureNamesInt[] =
         },
         &pgstat_track_syscache_usage_interval,
         0, 0, INT_MAX / 2,
-        NULL, NULL, NULL
+        NULL, &pgstat_track_syscache_assign_hook, NULL
     },
 
     {
@@ -4695,7 +4726,6 @@ discard_stack_value(struct config_generic *gconf, config_var_value *val)
     set_extra_field(gconf, &(val->extra), NULL);
 }
 
-
 /*
  * Fetch the sorted array pointer (exported for help_config.c's use ONLY)
  */
@@ -5487,6 +5517,22 @@ push_old_value(struct config_generic *gconf, GucAction action)
 
     /* Do we already have a stack entry of the current nest level? */
     stack = gconf->stack;
+
+    /* NONXACT action make existing stack useles */
+    if (action == GUC_ACTION_NONXACT)
+    {
+        while (stack)
+        {
+            GucStack *prev = stack->prev;
+
+            discard_stack_value(gconf, &stack->prior);
+            discard_stack_value(gconf, &stack->masked);
+            pfree(stack);
+            stack = prev;
+        }
+        stack = gconf->stack = NULL;
+    }
+
     if (stack && stack->nest_level >= GUCNestLevel)
     {
         /* Yes, so adjust its state if necessary */
@@ -5494,28 +5540,63 @@ push_old_value(struct config_generic *gconf, GucAction action)
         switch (action)
         {
             case GUC_ACTION_SET:
-                /* SET overrides any prior action at same nest level */
-                if (stack->state == GUC_SET_LOCAL)
+                if (stack->state == GUC_NONXACT)
                 {
-                    /* must discard old masked value */
-                    discard_stack_value(gconf, &stack->masked);
+                    /* NONXACT rollbacks to the current value */
+                    stack->scontext = gconf->scontext;
+                    set_stack_value(gconf, &stack->prior);
+                    stack->state = GUC_NONXACT_SET;
                 }
-                stack->state = GUC_SET;
+                else 
+                {
+                    /* SET overrides other prior actions at same nest level */
+                    if (stack->state == GUC_SET_LOCAL)
+                    {
+                        /* must discard old masked value */
+                        discard_stack_value(gconf, &stack->masked);
+                    }
+                    stack->state = GUC_SET;
+                }
+
                 break;
+
             case GUC_ACTION_LOCAL:
                 if (stack->state == GUC_SET)
                 {
-                    /* SET followed by SET LOCAL, remember SET's value */
+                    /* SET followed by SET LOCAL, remember it's value */
                     stack->masked_scontext = gconf->scontext;
                     set_stack_value(gconf, &stack->masked);
                     stack->state = GUC_SET_LOCAL;
                 }
+                else if (stack->state == GUC_NONXACT)
+                {
+                    /*
+                     * NONXACT followed by SET LOCAL, both prior and masked
+                     * are set to the current value
+                     */
+                    stack->scontext = gconf->scontext;
+                    set_stack_value(gconf, &stack->prior);
+                    stack->masked_scontext = stack->scontext;
+                    stack->masked = stack->prior;
+                    stack->state = GUC_NONXACT_LOCAL;
+                }
+                else if (stack->state == GUC_NONXACT_SET)
+                {
+                    /* NONXACT_SET followed by SET LOCAL, set masked */
+                    stack->masked_scontext = gconf->scontext;
+                    set_stack_value(gconf, &stack->masked);
+                    stack->state = GUC_NONXACT_LOCAL;
+                }
                 /* in all other cases, no change to stack entry */
                 break;
             case GUC_ACTION_SAVE:
                 /* Could only have a prior SAVE of same variable */
                 Assert(stack->state == GUC_SAVE);
                 break;
+
+            case GUC_ACTION_NONXACT:
+                Assert(false);
+                break;
         }
         Assert(guc_dirty);        /* must be set already */
         return;
@@ -5531,6 +5612,7 @@ push_old_value(struct config_generic *gconf, GucAction action)
 
     stack->prev = gconf->stack;
     stack->nest_level = GUCNestLevel;
+        
     switch (action)
     {
         case GUC_ACTION_SET:
@@ -5542,10 +5624,15 @@ push_old_value(struct config_generic *gconf, GucAction action)
         case GUC_ACTION_SAVE:
             stack->state = GUC_SAVE;
             break;
+        case GUC_ACTION_NONXACT:
+            stack->state = GUC_NONXACT;
+            break;
     }
     stack->source = gconf->source;
     stack->scontext = gconf->scontext;
-    set_stack_value(gconf, &stack->prior);
+
+    if (action != GUC_ACTION_NONXACT)
+        set_stack_value(gconf, &stack->prior);
 
     gconf->stack = stack;
 
@@ -5640,22 +5727,31 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
              * stack entries to avoid leaking memory.  If we do set one of
              * those flags, unused fields will be cleaned up after restoring.
              */
-            if (!isCommit)        /* if abort, always restore prior value */
-                restorePrior = true;
+            if (!isCommit)
+            {
+                /* GUC_NONXACT does't rollback */
+                if (stack->state != GUC_NONXACT)
+                    restorePrior = true;
+            }
             else if (stack->state == GUC_SAVE)
                 restorePrior = true;
             else if (stack->nest_level == 1)
             {
                 /* transaction commit */
-                if (stack->state == GUC_SET_LOCAL)
+                if (stack->state == GUC_SET_LOCAL ||
+                    stack->state == GUC_NONXACT_LOCAL)
                     restoreMasked = true;
-                else if (stack->state == GUC_SET)
+                else if (stack->state == GUC_SET ||
+                         stack->state == GUC_NONXACT_SET)
                 {
                     /* we keep the current active value */
                     discard_stack_value(gconf, &stack->prior);
                 }
-                else            /* must be GUC_LOCAL */
+                else if (stack->state != GUC_NONXACT)
+                {
+                    /* must be GUC_LOCAL */
                     restorePrior = true;
+                }
             }
             else if (prev == NULL ||
                      prev->nest_level < stack->nest_level - 1)
@@ -5677,11 +5773,27 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                         break;
 
                     case GUC_SET:
-                        /* next level always becomes SET */
-                        discard_stack_value(gconf, &stack->prior);
-                        if (prev->state == GUC_SET_LOCAL)
+                        if (prev->state == GUC_SET ||
+                            prev->state == GUC_NONXACT_SET)
+                        {
+                            discard_stack_value(gconf, &stack->prior);
+                        }
+                        else if (prev->state == GUC_NONXACT)
+                        {
+                            prev->scontext = stack->scontext;
+                            prev->prior = stack->prior;
+                            prev->state = GUC_NONXACT_SET;
+                        }
+                        else if (prev->state == GUC_SET_LOCAL ||
+                                 prev->state == GUC_NONXACT_LOCAL)
+                        {
+                            discard_stack_value(gconf, &stack->prior);
                             discard_stack_value(gconf, &prev->masked);
-                        prev->state = GUC_SET;
+                            if (prev->state == GUC_SET_LOCAL)
+                                prev->state = GUC_SET;
+                            else
+                                prev->state = GUC_NONXACT_SET;
+                        }
                         break;
 
                     case GUC_LOCAL:
@@ -5692,6 +5804,16 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                             prev->masked = stack->prior;
                             prev->state = GUC_SET_LOCAL;
                         }
+                        else if (prev->state == GUC_NONXACT)
+                        {
+                            prev->prior = stack->masked;
+                            prev->scontext = stack->masked_scontext;
+                            prev->masked = stack->masked;
+                            prev->masked_scontext = stack->masked_scontext;
+                            discard_stack_value(gconf, &stack->prior);
+                            discard_stack_value(gconf, &stack->masked);
+                            prev->state = GUC_NONXACT_SET;
+                        }
                         else
                         {
                             /* else just forget this stack level */
@@ -5700,15 +5822,32 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                         break;
 
                     case GUC_SET_LOCAL:
-                        /* prior state at this level no longer wanted */
-                        discard_stack_value(gconf, &stack->prior);
-                        /* copy down the masked state */
-                        prev->masked_scontext = stack->masked_scontext;
-                        if (prev->state == GUC_SET_LOCAL)
-                            discard_stack_value(gconf, &prev->masked);
-                        prev->masked = stack->masked;
-                        prev->state = GUC_SET_LOCAL;
+                        if (prev->state == GUC_NONXACT)
+                        {
+                            prev->prior = stack->prior;
+                            prev->masked = stack->prior;
+                            discard_stack_value(gconf, &stack->prior);
+                            discard_stack_value(gconf, &stack->masked);
+                            prev->state = GUC_NONXACT_SET;
+                        }
+                        else if (prev->state != GUC_NONXACT_SET)
+                        {
+                            /* prior state at this level no longer wanted */
+                            discard_stack_value(gconf, &stack->prior);
+                            /* copy down the masked state */
+                            prev->masked_scontext = stack->masked_scontext;
+                            if (prev->state == GUC_SET_LOCAL)
+                                discard_stack_value(gconf, &prev->masked);
+                            prev->masked = stack->masked;
+                            prev->state = GUC_SET_LOCAL;
+                        }
                         break;
+                    case GUC_NONXACT:
+                    case GUC_NONXACT_SET:
+                    case GUC_NONXACT_LOCAL:
+                        Assert(false);
+                        break;
+                        
                 }
             }
 
@@ -7989,7 +8128,8 @@ set_config_by_name(PG_FUNCTION_ARGS)
     char       *name;
     char       *value;
     char       *new_value;
-    bool        is_local;
+    int            set_action = GUC_ACTION_SET;
+
 
     if (PG_ARGISNULL(0))
         ereport(ERROR,
@@ -8009,18 +8149,27 @@ set_config_by_name(PG_FUNCTION_ARGS)
      * Get the desired state of is_local. Default to false if provided value
      * is NULL
      */
-    if (PG_ARGISNULL(2))
-        is_local = false;
-    else
-        is_local = PG_GETARG_BOOL(2);
+    if (!PG_ARGISNULL(2) && PG_GETARG_BOOL(2))
+        set_action = GUC_ACTION_LOCAL;
+
+    /*
+     * Get the desired state of is_nonxact. Default to false if provided value
+     * is NULL
+     */
+    if (!PG_ARGISNULL(3) && PG_GETARG_BOOL(3))
+    {
+        if (set_action == GUC_ACTION_LOCAL)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("Only one of is_local and is_nonxact can be true")));
+        set_action = GUC_ACTION_NONXACT;
+    }
 
     /* Note SET DEFAULT (argstring == NULL) is equivalent to RESET */
     (void) set_config_option(name,
                              value,
                              (superuser() ? PGC_SUSET : PGC_USERSET),
-                             PGC_S_SESSION,
-                             is_local ? GUC_ACTION_LOCAL : GUC_ACTION_SET,
-                             true, 0, false);
+                             PGC_S_SESSION, set_action, true, 0, false);
 
     /* get the new current value */
     new_value = GetConfigOptionByName(name, NULL, false);
@@ -8029,7 +8178,6 @@ set_config_by_name(PG_FUNCTION_ARGS)
     PG_RETURN_TEXT_P(cstring_to_text(new_value));
 }
 
-
 /*
  * Common code for DefineCustomXXXVariable subroutines: allocate the
  * new variable's config struct and fill in generic fields.
@@ -8228,6 +8376,13 @@ reapply_stacked_values(struct config_generic *variable,
                                          WARNING, false);
                 break;
 
+            case GUC_NONXACT:
+                (void) set_config_option(name, curvalue,
+                                         curscontext, cursource,
+                                         GUC_ACTION_NONXACT, true,
+                                         WARNING, false);
+                break;
+
             case GUC_LOCAL:
                 (void) set_config_option(name, curvalue,
                                          curscontext, cursource,
@@ -8247,6 +8402,33 @@ reapply_stacked_values(struct config_generic *variable,
                                          GUC_ACTION_LOCAL, true,
                                          WARNING, false);
                 break;
+
+            case GUC_NONXACT_SET:
+                /* first, apply the masked value as SET */
+                (void) set_config_option(name, stack->masked.val.stringval,
+                                         stack->masked_scontext, PGC_S_SESSION,
+                                         GUC_ACTION_NONXACT, true,
+                                         WARNING, false);
+                /* then apply the current value as LOCAL */
+                (void) set_config_option(name, curvalue,
+                                         curscontext, cursource,
+                                         GUC_ACTION_SET, true,
+                                         WARNING, false);
+                break;
+
+            case GUC_NONXACT_LOCAL:
+                /* first, apply the masked value as SET */
+                (void) set_config_option(name, stack->masked.val.stringval,
+                                         stack->masked_scontext, PGC_S_SESSION,
+                                         GUC_ACTION_NONXACT, true,
+                                         WARNING, false);
+                /* then apply the current value as LOCAL */
+                (void) set_config_option(name, curvalue,
+                                         curscontext, cursource,
+                                         GUC_ACTION_LOCAL, true,
+                                         WARNING, false);
+                break;
+
         }
 
         /* If we successfully made a stack entry, adjust its nest level */
@@ -10225,6 +10407,373 @@ GUCArrayReset(ArrayType *array)
     return newarray;
 }
 
+Size
+GucShmemSize(void)
+{
+    Size size;
+
+    size = sizeof(GucRemoteSetting);
+
+    return size;
+}
+
+void
+GucShmemInit(void)
+{
+    Size    size;
+    bool    found;
+
+    size = sizeof(GucRemoteSetting);
+    remote_setting = (GucRemoteSetting *)
+        ShmemInitStruct("GUC remote setting", size, &found);
+
+    if (!found)
+    {
+        MemSet(remote_setting, 0, size);
+        LWLockInitialize(&remote_setting->lock, LWLockNewTrancheId());
+    }
+
+    LWLockRegisterTranche(remote_setting->lock.tranche, "guc_remote");
+}
+
+/*
+ * set_backend_config: SQL callable function to set GUC variable of remote
+ * session.
+ */
+Datum
+set_backend_config(PG_FUNCTION_ARGS)
+{
+    int        pid   = PG_GETARG_INT32(0);
+    char   *name  = text_to_cstring(PG_GETARG_TEXT_P(1));
+    char   *value = text_to_cstring(PG_GETARG_TEXT_P(2));
+    TimestampTz    cancel_start;
+    PgBackendStatus *beentry;
+    int beid;
+    int rc;
+
+    if (strlen(name) >= NAMEDATALEN)
+        ereport(ERROR,
+                (errcode(ERRCODE_NAME_TOO_LONG),
+                 errmsg("name of GUC variable is too long")));
+    if (strlen(value) >= GUC_REMOTE_MAX_VALUE_LEN)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("value is too long"),
+                 errdetail("Maximum acceptable length of value is %d",
+                     GUC_REMOTE_MAX_VALUE_LEN - 1)));
+
+    /* find beentry for given pid */
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * This will be checked out by SendProcSignal but do here to emit
+     * appropriate message message.
+     */
+    if (!beentry)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("process PID %d not found", pid)));
+
+    /* allow only client backends */
+    if (beentry->st_backendType != B_BACKEND)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("not a client backend")));
+    
+    /*
+     * Wait if someone is sending a request. We need to wait with timeout
+     * since the current user of the struct doesn't wake me up.
+     */
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    while (remote_setting->state != REMGUC_VACANT)
+    {
+        LWLockRelease(&remote_setting->lock);
+        rc = WaitLatch(&MyProc->procLatch,
+                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                       200, PG_WAIT_ACTIVITY);
+
+        if (rc & WL_POSTMASTER_DEATH)
+            return (Datum) BoolGetDatum(false);
+
+        CHECK_FOR_INTERRUPTS();
+
+        LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    }
+
+    /* my turn, send a request */
+    Assert(remote_setting->state == REMGUC_VACANT);
+
+    remote_setting->state = REMGUC_REQUEST;
+    remote_setting->sourcepid = MyProcPid;
+    remote_setting->targetpid = pid;
+    remote_setting->userid = GetUserId();
+
+    strncpy(remote_setting->name, name, NAMEDATALEN);
+    remote_setting->name[NAMEDATALEN - 1] = 0;
+    strncpy(remote_setting->value, value, GUC_REMOTE_MAX_VALUE_LEN);
+    remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0;
+    remote_setting->sender_latch = MyLatch;
+
+    LWLockRelease(&remote_setting->lock);
+
+    if (SendProcSignal(pid, PROCSIG_REMOTE_GUC, InvalidBackendId) < 0)
+    {
+        remote_setting->state = REMGUC_VACANT;
+        ereport(ERROR,
+                (errmsg("could not signal backend with PID %d: %m", pid)));
+    }
+
+    /*
+     * This request is processed only while idle time of peer so it may take a
+     * long time before we get a response.
+     */
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    while (remote_setting->state != REMGUC_DONE)
+    {
+        LWLockRelease(&remote_setting->lock);
+        rc = WaitLatch(&MyProc->procLatch,
+                       WL_LATCH_SET | WL_POSTMASTER_DEATH,
+                       -1, PG_WAIT_ACTIVITY);
+
+        /* don't care of the state in the case.. */
+        if (rc & WL_POSTMASTER_DEATH)
+            return (Datum) BoolGetDatum(false);
+
+        LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+
+        /* get out if we got a query cancel request */
+        if (QueryCancelPending)
+            break;
+    }
+
+    /*
+     * Cancel the requset if possible. We cannot cancel the request in the
+     * case peer have processed it. We don't see QueryCancelPending but the
+     * request status so that the case is handled properly.
+     */
+    if (remote_setting->state == REMGUC_REQUEST)
+    {
+        Assert(QueryCancelPending);
+
+        remote_setting->state = REMGUC_CANCELING;
+        LWLockRelease(&remote_setting->lock);
+
+        if (SendProcSignal(pid,
+                           PROCSIG_REMOTE_GUC, InvalidBackendId) < 0)
+        {
+            remote_setting->state = REMGUC_VACANT;
+            ereport(ERROR,
+                    (errmsg("could not signal backend with PID %d: %m",
+                            pid)));
+        }
+
+        /* Peer must respond shortly, don't sleep for a long time. */
+        
+        cancel_start = GetCurrentTimestamp();
+
+        LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+        while (remote_setting->state != REMGUC_CANCELED &&
+               !TimestampDifferenceExceeds(cancel_start, GetCurrentTimestamp(),
+                                           GUC_REMOTE_CANCEL_TIMEOUT))
+        {
+            LWLockRelease(&remote_setting->lock);
+            rc = WaitLatch(&MyProc->procLatch,
+                           WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                           GUC_REMOTE_CANCEL_TIMEOUT, PG_WAIT_ACTIVITY);
+
+            /* don't care of the state in the case.. */
+            if (rc & WL_POSTMASTER_DEATH)
+                return (Datum) BoolGetDatum(false);
+
+            LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+        }
+
+        if (remote_setting->state != REMGUC_CANCELED)
+        {
+            remote_setting->state = REMGUC_VACANT;
+            ereport(ERROR, (errmsg("failed cancelling remote GUC request")));
+        }
+
+        remote_setting->state = REMGUC_VACANT;
+        LWLockRelease(&remote_setting->lock);
+
+        ereport(INFO,
+                (errmsg("remote GUC change request to PID %d is canceled",
+                              pid)));
+
+        return (Datum) BoolGetDatum(false);
+    }
+
+    Assert (remote_setting->state == REMGUC_DONE);
+
+    /* ereport exits on query cancel, we need this before that */
+    remote_setting->state = REMGUC_VACANT;
+
+    if (QueryCancelPending)
+        ereport(INFO,
+                (errmsg("remote GUC change request to PID %d already completed",
+                        pid)));
+                
+    if (!remote_setting->success)
+        ereport(ERROR,
+                (errmsg("%s", remote_setting->value)));
+
+    LWLockRelease(&remote_setting->lock);
+
+    return (Datum) BoolGetDatum(true);
+}
+
+
+void
+HandleRemoteGucSetInterrupt(void)
+{
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+
+    /* check if any request is being sent to me */
+    if (remote_setting->targetpid == MyProcPid)
+    {
+        switch (remote_setting->state)
+        {
+        case REMGUC_REQUEST:
+            InterruptPending = true;
+            RemoteGucChangePending = true;
+            break;
+        case REMGUC_CANCELING:
+            InterruptPending = true;
+            RemoteGucChangePending = true;
+            remote_setting->state = REMGUC_CANCELED;
+            SetLatch(remote_setting->sender_latch);
+            break;
+        default:
+            break;
+        }
+    }
+    LWLockRelease(&remote_setting->lock);
+}
+
+void
+HandleGucRemoteChanges(void)
+{
+    MemoryContext currentcxt = CurrentMemoryContext;
+    bool    canceling = false;
+    bool    process_request = true;
+    int        saveInterruptHoldoffCount = 0;
+    int        saveQueryCancelHoldoffCount = 0;
+
+    RemoteGucChangePending = false;
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+
+    /* skip if this request is no longer for me */
+    if (remote_setting->targetpid != MyProcPid)
+        process_request = false;
+    else
+    {
+        switch (remote_setting->state)
+        {
+        case REMGUC_REQUEST:
+            remote_setting->state = REMGUC_INPROCESS;
+            break;
+        case REMGUC_CANCELING:
+            /*
+             * This request is already canceled but entered this function
+             * before receiving signal. Cancel the request here.
+             */
+            remote_setting->state = REMGUC_CANCELED;
+            remote_setting->success = false;
+            canceling = true;
+            break;
+        case REMGUC_VACANT:
+        case REMGUC_CANCELED:
+        case REMGUC_INPROCESS:
+        case REMGUC_DONE:
+            /* Just ignore the cases */
+            process_request = false;
+            break;
+        }
+    }
+
+    LWLockRelease(&remote_setting->lock);
+
+    if (!process_request)
+        return;
+
+    if (canceling)
+    {
+        SetLatch(remote_setting->sender_latch);
+        return;
+    }
+
+
+    /* Okay, actually modify variable */
+    remote_setting->success = true;
+
+    PG_TRY();
+    {
+        bool     has_privilege;
+        bool     is_superuser;
+        bool end_transaction = false;
+        /*
+         * XXXX: ERROR resets the following varialbes but we don't want that.
+         */
+        saveInterruptHoldoffCount = InterruptHoldoffCount;
+        saveQueryCancelHoldoffCount = QueryCancelHoldoffCount;
+
+        /* superuser_arg requires a transaction */
+        if (!IsTransactionState())
+        {
+            StartTransactionCommand();
+            end_transaction  = true;
+        }
+        is_superuser = superuser_arg(remote_setting->userid);
+        has_privilege = is_superuser ||
+            has_privs_of_role(remote_setting->userid, GetUserId());
+
+        if (end_transaction)
+            CommitTransactionCommand();
+
+        if (!has_privilege)
+            elog(ERROR, "role %u is not allowed to set GUC variables on the session with PID %d",
+                 remote_setting->userid, MyProcPid);
+        
+        (void) set_config_option(remote_setting->name, remote_setting->value,
+                                 is_superuser ? PGC_SUSET : PGC_USERSET,
+                                 PGC_S_SESSION, GUC_ACTION_NONXACT,
+                                 true, ERROR, false);
+    }
+    PG_CATCH();
+    {
+        ErrorData *errdata;
+        MemoryContextSwitchTo(currentcxt);
+        errdata = CopyErrorData();
+        remote_setting->success = false;
+        strncpy(remote_setting->value, errdata->message,
+                GUC_REMOTE_MAX_VALUE_LEN);
+        remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0;
+        FlushErrorState();
+
+        /* restore the saved value */
+        InterruptHoldoffCount = saveInterruptHoldoffCount ;
+        QueryCancelHoldoffCount = saveQueryCancelHoldoffCount;
+        
+    }
+    PG_END_TRY();
+
+    ereport(LOG,
+            (errmsg("GUC variable \"%s\" is changed to \"%s\" by request from another backend with PID %d",
+                    remote_setting->name, remote_setting->value,
+                    remote_setting->sourcepid)));
+
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    remote_setting->state = REMGUC_DONE;
+    LWLockRelease(&remote_setting->lock);
+
+    SetLatch(remote_setting->sender_latch);
+}
+
 /*
  * Validate a proposed option setting for GUCArrayAdd/Delete/Reset.
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4de9fdee44..62a64db022 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5647,8 +5647,8 @@
   proargtypes => 'text bool', prosrc => 'show_config_by_name_missing_ok' },
 { oid => '2078', descr => 'SET X as a function',
   proname => 'set_config', proisstrict => 'f', provolatile => 'v',
-  proparallel => 'u', prorettype => 'text', proargtypes => 'text text bool',
-  prosrc => 'set_config_by_name' },
+  proparallel => 'u', prorettype => 'text',
+  proargtypes => 'text text bool bool', prosrc => 'set_config_by_name' },
 { oid => '2084', descr => 'SHOW ALL as a function',
   proname => 'pg_show_all_settings', prorows => '1000', proretset => 't',
   provolatile => 's', prorettype => 'record', proargtypes => '',
@@ -9625,6 +9625,12 @@
   proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
   proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
   prosrc => 'pgstat_get_syscache_stats' },
+{ oid => '3424',
+  descr => 'set config of another backend',
+  proname => 'pg_set_backend_config', proisstrict => 'f',
+  proretset => 'f', provolatile => 'v', proparallel => 'u',
+  prorettype => 'bool', proargtypes => 'int4 text text',
+  prosrc => 'set_backend_config' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 20add5052c..198fa42f80 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -833,7 +833,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_REMOTE_GUC
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 6db0d69b71..4ad4927d3d 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,9 @@ typedef enum
     PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
     PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+    /* Remote GUC setting */
+    PROCSIG_REMOTE_GUC,
+
     NUM_PROCSIGNALS                /* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index df2e556b02..0f3498fc6d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -193,7 +193,8 @@ typedef enum
     /* Types of set_config_option actions */
     GUC_ACTION_SET,                /* regular SET command */
     GUC_ACTION_LOCAL,            /* SET LOCAL command */
-    GUC_ACTION_SAVE                /* function SET option, or temp assignment */
+    GUC_ACTION_SAVE,            /* function SET option, or temp assignment */
+    GUC_ACTION_NONXACT            /* transactional setting */
 } GucAction;
 
 #define GUC_QUALIFIER_SEPARATOR '.'
@@ -268,6 +269,8 @@ extern int    tcp_keepalives_idle;
 extern int    tcp_keepalives_interval;
 extern int    tcp_keepalives_count;
 
+extern volatile bool RemoteGucChangePending;
+
 #ifdef TRACE_SORT
 extern bool trace_sort;
 #endif
@@ -275,6 +278,11 @@ extern bool trace_sort;
 /*
  * Functions exported by guc.c
  */
+extern Size GucShmemSize(void);
+extern void GucShmemInit(void);
+extern Datum set_backend_setting(PG_FUNCTION_ARGS);
+extern void HandleRemoteGucSetInterrupt(void);
+extern void HandleGucRemoteChanges(void);
 extern void SetConfigOption(const char *name, const char *value,
                 GucContext context, GucSource source);
 
@@ -394,6 +402,9 @@ extern Size EstimateGUCStateSpace(void);
 extern void SerializeGUCState(Size maxsize, char *start_address);
 extern void RestoreGUCState(void *gucstate);
 
+/* Remote GUC setting */
+extern void HandleGucRemoteChanges(void);
+
 /* Support for messages reported from GUC check hooks */
 
 extern PGDLLIMPORT char *GUC_check_errmsg_string;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 6f9fdb6a5f..4980a01c97 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -115,7 +115,10 @@ typedef enum
     GUC_SAVE,                    /* entry caused by function SET option */
     GUC_SET,                    /* entry caused by plain SET command */
     GUC_LOCAL,                    /* entry caused by SET LOCAL command */
-    GUC_SET_LOCAL                /* entry caused by SET then SET LOCAL */
+    GUC_NONXACT,                /* entry caused by non-transactional ops */
+    GUC_SET_LOCAL,                /* entry caused by SET then SET LOCAL */
+    GUC_NONXACT_SET,            /* entry caused by NONXACT then SET */
+    GUC_NONXACT_LOCAL            /* entry caused by NONXACT then (SET)LOCAL */
 } GucStackState;
 
 typedef struct guc_stack
diff --git a/src/test/regress/expected/guc.out b/src/test/regress/expected/guc.out
index 43ac5f5f11..2c074705c7 100644
--- a/src/test/regress/expected/guc.out
+++ b/src/test/regress/expected/guc.out
@@ -476,6 +476,229 @@ SELECT '2006-08-13 12:34:56'::timestamptz;
  2006-08-13 12:34:56-07
 (1 row)
 
+-- NONXACT followed by SET, SET LOCAL through COMMIT
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+ work_mem 
+----------
+ 512kB
+(1 row)
+
+COMMIT;
+SHOW work_mem;    -- must see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+-- NONXACT followed by SET, SET LOCAL through ROLLBACK
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+ work_mem 
+----------
+ 512kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+-- SET, SET LOCAL followed by NONXACT through COMMIT
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+COMMIT;
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+-- SET, SET LOCAL followed by NONXACT through ROLLBACK
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+-- NONXACT and SAVEPOINT
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+ work_mem 
+----------
+ 384kB
+(1 row)
+
+COMMIT;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+ work_mem 
+----------
+ 384kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+COMMIT;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET work_mem TO DEFAULT;
 --
 -- Test RESET.  We use datestyle because the reset value is forced by
 -- pg_regress, so it doesn't depend on the installation's configuration.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 735dd37acf..3569edc22d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1918,6 +1918,30 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.nentries,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.nentries,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,nentries, last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2349,7 +2373,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
diff --git a/src/test/regress/sql/guc.sql b/src/test/regress/sql/guc.sql
index 23e5029780..2fb23caafe 100644
--- a/src/test/regress/sql/guc.sql
+++ b/src/test/regress/sql/guc.sql
@@ -133,6 +133,94 @@ SHOW vacuum_cost_delay;
 SHOW datestyle;
 SELECT '2006-08-13 12:34:56'::timestamptz;
 
+-- NONXACT followed by SET, SET LOCAL through COMMIT
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+COMMIT;
+SHOW work_mem;    -- must see 256kB
+
+-- NONXACT followed by SET, SET LOCAL through ROLLBACK
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+
+-- SET, SET LOCAL followed by NONXACT through COMMIT
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SHOW work_mem;    -- must see 128kB
+COMMIT;
+SHOW work_mem;    -- must see 128kB
+
+-- SET, SET LOCAL followed by NONXACT through ROLLBACK
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SHOW work_mem;    -- must see 128kB
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+
+-- NONXACT and SAVEPOINT
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+COMMIT;
+SHOW work_mem; -- will see 256kB
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+COMMIT;
+SHOW work_mem; -- will see 256kB
+
+SET work_mem TO DEFAULT;
 --
 -- Test RESET.  We use datestyle because the reset value is forced by
 -- pg_regress, so it doesn't depend on the installation's configuration.
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Dmitry Dolgov
Дата:
> On Tue, Nov 27, 2018 at 11:40 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> The attached is the rebased version that has multidimentional
> ageclass.

Thank you,

Just for the information, cfbot complains about this patch because:

pgstatfuncs.c: In function ‘pgstat_get_syscache_stats’:
pgstatfuncs.c:1973:8: error: ignoring return value of ‘fread’,
declared with attribute warn_unused_result [-Werror=unused-result]
   fread(&cacheid, sizeof(int), 1, fpin);
        ^
pgstatfuncs.c:1974:8: error: ignoring return value of ‘fread’,
declared with attribute warn_unused_result [-Werror=unused-result]
   fread(&last_update, sizeof(TimestampTz), 1, fpin);
        ^

I'm moving it to the next CF as "Waiting on author", since as far as I
understood you want to address more commentaries from the reviewer.


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
Hello,

Sorry for delay. 
The detailed comments for the source code will be provided later.

>> I just thought that the pair of ageclass and nentries can be
>> represented as json or multi-dimensional array but in virtual they are
>> all same and can be converted each other using some functions. So I'm not sure
>which representaion is better one.
>
>Multi dimentional array in any style sounds reasonable. Maybe array is preferable in
>system views as it is a basic type than JSON. In the attached, it looks like the follows:
>
>=# select * from pg_stat_syscache  where ntuples > 100; -[ RECORD
>1 ]--------------------------------------------------
>pid         | 1817
>relname     | pg_class
>cache_name  | pg_class_oid_index
>size        | 2048
>ntuples     | 189
>searches    | 1620
>hits        | 1431
>neg_hits    | 0
>ageclass    | {{30,189},{60,0},{600,0},{1200,0},{1800,0},{0,0}}
>last_update | 2018-11-27 19:22:00.74026+09

Thanks, cool. That seems better to me.

>
>> >3. non-transactional GUC setting (in 0003)
>> >
>> >It allows setting GUC variable set by the action
>> >GUC_ACTION_NONXACT(the name requires condieration) survive beyond
>> >rollback. It is required by remote guc setting to work sanely.
>> >Without the feature a remote-set value within a trasction will
>> >disappear involved in rollback. The only local interface for the
>> >NONXACT action is set_config(name, value, is_local=false, is_nonxact = true).
>pg_set_backend_guc() below works on this feature.
>>
>> TBH, I'm not familiar with around this and I may be missing something.
>> In order to change the other backend's GUC value, is ignoring
>> transactional behevior always necessary? When transaction of GUC
>> setting is failed and rollbacked, if the error message is supposeed to
>> be reported I thought just trying the transaction again is enough.
>
>The target backend can be running frequent transaction.  The invoker backend cannot
>know whether the remote change happend during a transaction and whether the
>transaction if any is committed or aborted, no error message sent to invoker backend.
>We could wait for the end of a trasaction but that doesn't work with long transactions.
>
>Maybe we don't need the feature in GUC system but adding another similar feature
>doesn't seem reasonable. This would be useful for some other tracking features.

Thank you for the clarification. 


>> >4. pg_set_backend_guc() function.
>> >
>> >Of course syscache statistics recording consumes significant amount
>> >of time so it cannot be turned on usually. On the other hand since
>> >this feature is turned on by GUC, it is needed to grab the active
>> >client connection to turn on/off the feature(but we cannot). Instead, I provided a
>means to change GUC variables in another backend.
>> >
>> >pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
>> >on the backend "pid" to "value".
>> >
>> >
>> >
>> >With the above tools, we can inspect catcache statistics of seemingly bloated
>process.
>> >
>> >A. Find a bloated process pid using ps or something.
>> >
>> >B. Turn on syscache stats on the process.
>> >  =# select pg_set_backend_guc(9984, 'track_syscache_usage_interval',
>> >'10000');
>> >
>> >C. Examine the statitics.
>> >
>> >=# select pid, relname, cache_name, size from pg_stat_syscache order
>> >by size desc limit 3;
>> > pid  |   relname    |            cache_name            |   size
>> >------+--------------+----------------------------------+----------
>> > 9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112
>> > 9984 | pg_cast      | pg_cast_source_target_index      |     4096
>> > 9984 | pg_operator  | pg_operator_oprname_l_r_n_index  |     4096
>> >
>> >
>> >=# select * from pg_stat_syscache where cache_name =
>> >'pg_statistic_relid_att_inh_index'::regclass;
>> >-[ RECORD 1 ]---------------------------------
>> >pid         | 9984
>> >relname     | pg_statistic
>> >cache_name  | pg_statistic_relid_att_inh_index
>> >size        | 11026176
>> >ntuples     | 77950
>> >searches    | 77950
>> >hits        | 0
>> >neg_hits    | 0
>> >ageclass    | {30,60,600,1200,1800,0}
>> >nentries    | {17630,16950,43370,0,0,0}
>> >last_update | 2018-10-17 15:58:19.738164+09
>>
>> The output of this view seems good to me.
>>
>> I can imagine this use case. Does the use case of setting GUC locally never happen?
>> I mean can the setting be locally changed?
>
>Syscahe grows through a life of a backend/session. No other client cannot connect to
>it at the same time. So the variable must be set at the start of a backend using ALTER
>USER/DATABASE, or the client itself is obliged to deliberitely turn on the feature at a
>convenient time. I suppose that in most use cases one wants to turn on this feature
>after he sees another session is eating memory more and more.
>
>The attached is the rebased version that has multidimentional ageclass.

Thank you! That's convenient.
How about splitting this non-xact guc and remote guc setting feature as another commit fest entry? 
I'm planning to review 001 and 002 patch in more detail and hopefully turn it to 'ready for committer' 
and review remote guc feature later. 
Related to the feature division why have you discarded pruning of relcache and plancache?
Personally I want relcache one as well as catcache because regarding memory bloat there is some correlation between
them.

Regards,
Takeshi Ideriha



RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
>The detailed comments for the source code will be provided later.

Hi, 
I'm adding some comments to 0001 and 0002 one.

[0001 patch]

+                       /*
                                                   
 
+                        * Calculate the duration from the time of the last access to the
                                                   
 
+                        * "current" time. Since catcacheclock is not advanced within a
                                                   
 
+                        * transaction, the entries that are accessed within the current
                                                   
 
+                        * transaction won't be pruned.
                                                   
 
+                        */
                                                   
 
+                       TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); 


+                       /*                                                                                        
+                        * Try to remove entries older than cache_prune_min_age seconds.                          

+                       if (entry_age > cache_prune_min_age) 


Can you change this comparison between entry_age and cache_prune_min_age 
to "entry_age >= cache_prune_min_age"?
That is, I want the feature that entries that are accessed even within the transaction 
is pruned in case of cache_prune_min_age = 0 
I can hit upon some of my customer who want to always keep memory usage below certain limit as strictly as possible.
This kind of strict user would set cache_prune_min_age to 0 and would not want to exceed the memory target even
if within a transaction.

As I put miscellaneous comments about 0001 patch in some previous email, so please take a look at it.

[0002 patch]
I haven't looked into every detail but here are some comments.

Maybe you would also need to add some sentences to this page:
https://www.postgresql.org/docs/current/monitoring-stats.html

+pgstat_get_syscache_stats(PG_FUNCTION_ARGS) 
Function name like 'pg_stat_XXX' would match surrounding code. 

When applying patch I found trailing whitespace warning:
../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:157: trailing whitespace.

../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:256: trailing whitespace.

../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:301: trailing whitespace.

../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:483: trailing whitespace.

../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:539: trailing whitespace.
    

Regards,
Takeshi Ideriha



Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
I'm really disappointed by the direction this thread is going in.
The latest patches add an enormous amount of mechanism, and user-visible
complexity, to do something that we learned was a bad idea decades ago.
Putting a limit on the size of the syscaches doesn't accomplish anything
except to add cycles if your cache working set is below the limit, or
make performance fall off a cliff if it's above the limit.  I don't think
there's any reason to believe that making it more complicated will avoid
that problem.

What does seem promising is something similar to Horiguchi-san's
original patches all the way back at

https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro@lab.ntt.co.jp

That is, identify usage patterns in which we tend to fill the caches with
provably no-longer-useful entries, and improve those particular cases.
Horiguchi-san identified one such case in that message: negative entries
in the STATRELATTINH cache, caused by the planner probing for stats that
aren't there, and then not cleared when the relevant table gets dropped
(since, by definition, they don't match any pg_statistic entry that gets
deleted).  We saw another recent report of the same problem at

https://www.postgresql.org/message-id/flat/2114009259.1866365.1544469996900%40mail.yahoo.com

so I'd been thinking about ways to fix that case in particular.  I came
up with a fix that I think is simpler and a bit more efficient than
what Horiguchi-san proposed originally: rather than trying to reverse-
engineer what to do in low-level cache callbacks, let's have the catalog
manipulation code explicitly send out invalidation commands when the
relevant situations arise.  In the attached, heap.c's RemoveStatistics
sends out an sinval message commanding deletion of negative STATRELATTINH
entries that match the OID of the table being deleted.  We could use the
same infrastructure to clean out dead RELNAMENSP entries after a schema
deletion, as per Horiguchi-san's second original suggestion; although
I haven't done so here because I'm not really convinced that that's got
an attractive cost-benefit ratio.  (In both my patch and Horiguchi-san's,
we have to traverse all entries in the affected cache, so sending out one
of these messages is potentially not real cheap.)

To do this we need to adjust the representation of sinval messages so
that we can have two different kinds of messages that include a cache ID.
Fortunately, because there's padding space available, that's not costly.
0001 below is a simple refactoring patch that converts the message type
ID into a plain enum field that's separate from the cache ID if any.
(I'm inclined to apply this whether or not people like 0002: it makes
the code clearer, more maintainable, and probably a shade faster thanks
to replacing an if-then-else chain with a switch.)  Then 0002 adds the
feature of an sinval message type saying "delete negative entries in
cache X that have OID Y in key column Z", and teaches RemoveStatistics
to use that.

Thoughts?

            regards, tom lane

diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index c295358..a7f367c 100644
*** a/src/backend/access/rmgrdesc/standbydesc.c
--- b/src/backend/access/rmgrdesc/standbydesc.c
*************** standby_desc_invalidations(StringInfo bu
*** 111,131 ****
      {
          SharedInvalidationMessage *msg = &msgs[i];

!         if (msg->id >= 0)
!             appendStringInfo(buf, " catcache %d", msg->id);
!         else if (msg->id == SHAREDINVALCATALOG_ID)
!             appendStringInfo(buf, " catalog %u", msg->cat.catId);
!         else if (msg->id == SHAREDINVALRELCACHE_ID)
!             appendStringInfo(buf, " relcache %u", msg->rc.relId);
!         /* not expected, but print something anyway */
!         else if (msg->id == SHAREDINVALSMGR_ID)
!             appendStringInfoString(buf, " smgr");
!         /* not expected, but print something anyway */
!         else if (msg->id == SHAREDINVALRELMAP_ID)
!             appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
!         else if (msg->id == SHAREDINVALSNAPSHOT_ID)
!             appendStringInfo(buf, " snapshot %u", msg->sn.relId);
!         else
!             appendStringInfo(buf, " unrecognized id %d", msg->id);
      }
  }
--- 111,141 ----
      {
          SharedInvalidationMessage *msg = &msgs[i];

!         switch ((SharedInvalMsgType) msg->id)
!         {
!             case SharedInvalCatcache:
!                 appendStringInfo(buf, " catcache %d", msg->cc.cacheId);
!                 break;
!             case SharedInvalCatalog:
!                 appendStringInfo(buf, " catalog %u", msg->cat.catId);
!                 break;
!             case SharedInvalRelcache:
!                 appendStringInfo(buf, " relcache %u", msg->rc.relId);
!                 break;
!             case SharedInvalSmgr:
!                 /* not expected, but print something anyway */
!                 appendStringInfoString(buf, " smgr");
!                 break;
!             case SharedInvalRelmap:
!                 /* not expected, but print something anyway */
!                 appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
!                 break;
!             case SharedInvalSnapshot:
!                 appendStringInfo(buf, " snapshot %u", msg->sn.relId);
!                 break;
!             default:
!                 appendStringInfo(buf, " unrecognized id %d", msg->id);
!                 break;
!         }
      }
  }
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 80d7a76..5bc08b0 100644
*** a/src/backend/utils/cache/inval.c
--- b/src/backend/utils/cache/inval.c
*************** AddCatcacheInvalidationMessage(Invalidat
*** 340,346 ****
      SharedInvalidationMessage msg;

      Assert(id < CHAR_MAX);
!     msg.cc.id = (int8) id;
      msg.cc.dbId = dbId;
      msg.cc.hashValue = hashValue;

--- 340,347 ----
      SharedInvalidationMessage msg;

      Assert(id < CHAR_MAX);
!     msg.cc.id = SharedInvalCatcache;
!     msg.cc.cacheId = (int8) id;
      msg.cc.dbId = dbId;
      msg.cc.hashValue = hashValue;

*************** AddCatalogInvalidationMessage(Invalidati
*** 367,373 ****
  {
      SharedInvalidationMessage msg;

!     msg.cat.id = SHAREDINVALCATALOG_ID;
      msg.cat.dbId = dbId;
      msg.cat.catId = catId;
      /* check AddCatcacheInvalidationMessage() for an explanation */
--- 368,374 ----
  {
      SharedInvalidationMessage msg;

!     msg.cat.id = SharedInvalCatalog;
      msg.cat.dbId = dbId;
      msg.cat.catId = catId;
      /* check AddCatcacheInvalidationMessage() for an explanation */
*************** AddRelcacheInvalidationMessage(Invalidat
*** 391,403 ****
       * don't need to add individual ones when it is present.
       */
      ProcessMessageList(hdr->rclist,
!                        if (msg->rc.id == SHAREDINVALRELCACHE_ID &&
                             (msg->rc.relId == relId ||
                              msg->rc.relId == InvalidOid))
                         return);

      /* OK, add the item */
!     msg.rc.id = SHAREDINVALRELCACHE_ID;
      msg.rc.dbId = dbId;
      msg.rc.relId = relId;
      /* check AddCatcacheInvalidationMessage() for an explanation */
--- 392,404 ----
       * don't need to add individual ones when it is present.
       */
      ProcessMessageList(hdr->rclist,
!                        if (msg->rc.id == SharedInvalRelcache &&
                             (msg->rc.relId == relId ||
                              msg->rc.relId == InvalidOid))
                         return);

      /* OK, add the item */
!     msg.rc.id = SharedInvalRelcache;
      msg.rc.dbId = dbId;
      msg.rc.relId = relId;
      /* check AddCatcacheInvalidationMessage() for an explanation */
*************** AddSnapshotInvalidationMessage(Invalidat
*** 418,429 ****
      /* Don't add a duplicate item */
      /* We assume dbId need not be checked because it will never change */
      ProcessMessageList(hdr->rclist,
!                        if (msg->sn.id == SHAREDINVALSNAPSHOT_ID &&
                             msg->sn.relId == relId)
                         return);

      /* OK, add the item */
!     msg.sn.id = SHAREDINVALSNAPSHOT_ID;
      msg.sn.dbId = dbId;
      msg.sn.relId = relId;
      /* check AddCatcacheInvalidationMessage() for an explanation */
--- 419,430 ----
      /* Don't add a duplicate item */
      /* We assume dbId need not be checked because it will never change */
      ProcessMessageList(hdr->rclist,
!                        if (msg->sn.id == SharedInvalSnapshot &&
                             msg->sn.relId == relId)
                         return);

      /* OK, add the item */
!     msg.sn.id = SharedInvalSnapshot;
      msg.sn.dbId = dbId;
      msg.sn.relId = relId;
      /* check AddCatcacheInvalidationMessage() for an explanation */
*************** RegisterSnapshotInvalidation(Oid dbId, O
*** 553,629 ****
  void
  LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
  {
!     if (msg->id >= 0)
      {
!         if (msg->cc.dbId == MyDatabaseId || msg->cc.dbId == InvalidOid)
!         {
!             InvalidateCatalogSnapshot();

!             SysCacheInvalidate(msg->cc.id, msg->cc.hashValue);

!             CallSyscacheCallbacks(msg->cc.id, msg->cc.hashValue);
!         }
!     }
!     else if (msg->id == SHAREDINVALCATALOG_ID)
!     {
!         if (msg->cat.dbId == MyDatabaseId || msg->cat.dbId == InvalidOid)
!         {
!             InvalidateCatalogSnapshot();

!             CatalogCacheFlushCatalog(msg->cat.catId);

!             /* CatalogCacheFlushCatalog calls CallSyscacheCallbacks as needed */
!         }
!     }
!     else if (msg->id == SHAREDINVALRELCACHE_ID)
!     {
!         if (msg->rc.dbId == MyDatabaseId || msg->rc.dbId == InvalidOid)
!         {
!             int            i;

!             if (msg->rc.relId == InvalidOid)
!                 RelationCacheInvalidate();
!             else
!                 RelationCacheInvalidateEntry(msg->rc.relId);

!             for (i = 0; i < relcache_callback_count; i++)
!             {
!                 struct RELCACHECALLBACK *ccitem = relcache_callback_list + i;

!                 ccitem->function(ccitem->arg, msg->rc.relId);
              }
!         }
!     }
!     else if (msg->id == SHAREDINVALSMGR_ID)
!     {
!         /*
!          * We could have smgr entries for relations of other databases, so no
!          * short-circuit test is possible here.
!          */
!         RelFileNodeBackend rnode;

!         rnode.node = msg->sm.rnode;
!         rnode.backend = (msg->sm.backend_hi << 16) | (int) msg->sm.backend_lo;
!         smgrclosenode(rnode);
!     }
!     else if (msg->id == SHAREDINVALRELMAP_ID)
!     {
!         /* We only care about our own database and shared catalogs */
!         if (msg->rm.dbId == InvalidOid)
!             RelationMapInvalidate(true);
!         else if (msg->rm.dbId == MyDatabaseId)
!             RelationMapInvalidate(false);
!     }
!     else if (msg->id == SHAREDINVALSNAPSHOT_ID)
!     {
!         /* We only care about our own database and shared catalogs */
!         if (msg->rm.dbId == InvalidOid)
!             InvalidateCatalogSnapshot();
!         else if (msg->rm.dbId == MyDatabaseId)
!             InvalidateCatalogSnapshot();
      }
-     else
-         elog(FATAL, "unrecognized SI message ID: %d", msg->id);
  }

  /*
--- 554,633 ----
  void
  LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg)
  {
!     switch ((SharedInvalMsgType) msg->id)
      {
!         case SharedInvalCatcache:
!             if (msg->cc.dbId == MyDatabaseId || msg->cc.dbId == InvalidOid)
!             {
!                 InvalidateCatalogSnapshot();

!                 SysCacheInvalidate(msg->cc.cacheId, msg->cc.hashValue);

!                 CallSyscacheCallbacks(msg->cc.cacheId, msg->cc.hashValue);
!             }
!             break;
!         case SharedInvalCatalog:
!             if (msg->cat.dbId == MyDatabaseId || msg->cat.dbId == InvalidOid)
!             {
!                 InvalidateCatalogSnapshot();

!                 CatalogCacheFlushCatalog(msg->cat.catId);

!                 /*
!                  * CatalogCacheFlushCatalog calls CallSyscacheCallbacks as
!                  * needed
!                  */
!             }
!             break;
!         case SharedInvalRelcache:
!             if (msg->rc.dbId == MyDatabaseId || msg->rc.dbId == InvalidOid)
!             {
!                 int            i;

!                 if (msg->rc.relId == InvalidOid)
!                     RelationCacheInvalidate();
!                 else
!                     RelationCacheInvalidateEntry(msg->rc.relId);

!                 for (i = 0; i < relcache_callback_count; i++)
!                 {
!                     struct RELCACHECALLBACK *ccitem = relcache_callback_list + i;

!                     ccitem->function(ccitem->arg, msg->rc.relId);
!                 }
              }
!             break;
!         case SharedInvalSmgr:
!             {
!                 /*
!                  * We could have smgr entries for relations of other
!                  * databases, so no short-circuit test is possible here.
!                  */
!                 RelFileNodeBackend rnode;

!                 rnode.node = msg->sm.rnode;
!                 rnode.backend = (msg->sm.backend_hi << 16) | (int) msg->sm.backend_lo;
!                 smgrclosenode(rnode);
!                 break;
!             }
!         case SharedInvalRelmap:
!             /* We only care about our own database and shared catalogs */
!             if (msg->rm.dbId == InvalidOid)
!                 RelationMapInvalidate(true);
!             else if (msg->rm.dbId == MyDatabaseId)
!                 RelationMapInvalidate(false);
!             break;
!         case SharedInvalSnapshot:
!             /* We only care about our own database and shared catalogs */
!             if (msg->rm.dbId == InvalidOid)
!                 InvalidateCatalogSnapshot();
!             else if (msg->rm.dbId == MyDatabaseId)
!                 InvalidateCatalogSnapshot();
!             break;
!         default:
!             elog(FATAL, "unrecognized SI message ID: %d", msg->id);
!             break;
      }
  }

  /*
*************** CacheInvalidateSmgr(RelFileNodeBackend r
*** 1351,1357 ****
  {
      SharedInvalidationMessage msg;

!     msg.sm.id = SHAREDINVALSMGR_ID;
      msg.sm.backend_hi = rnode.backend >> 16;
      msg.sm.backend_lo = rnode.backend & 0xffff;
      msg.sm.rnode = rnode.node;
--- 1355,1361 ----
  {
      SharedInvalidationMessage msg;

!     msg.sm.id = SharedInvalSmgr;
      msg.sm.backend_hi = rnode.backend >> 16;
      msg.sm.backend_lo = rnode.backend & 0xffff;
      msg.sm.rnode = rnode.node;
*************** CacheInvalidateRelmap(Oid databaseId)
*** 1381,1387 ****
  {
      SharedInvalidationMessage msg;

!     msg.rm.id = SHAREDINVALRELMAP_ID;
      msg.rm.dbId = databaseId;
      /* check AddCatcacheInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
--- 1385,1391 ----
  {
      SharedInvalidationMessage msg;

!     msg.rm.id = SharedInvalRelmap;
      msg.rm.dbId = databaseId;
      /* check AddCatcacheInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h
index 635acda..d0d9ece 100644
*** a/src/include/storage/sinval.h
--- b/src/include/storage/sinval.h
***************
*** 28,36 ****
   *    * invalidate the mapped-relation mapping for a given database
   *    * invalidate any saved snapshot that might be used to scan a given relation
   * More types could be added if needed.  The message type is identified by
!  * the first "int8" field of the message struct.  Zero or positive means a
!  * specific-catcache inval message (and also serves as the catcache ID field).
!  * Negative values identify the other message types, as per codes below.
   *
   * Catcache inval events are initially driven by detecting tuple inserts,
   * updates and deletions in system catalogs (see CacheInvalidateHeapTuple).
--- 28,34 ----
   *    * invalidate the mapped-relation mapping for a given database
   *    * invalidate any saved snapshot that might be used to scan a given relation
   * More types could be added if needed.  The message type is identified by
!  * the first "int8" field of the message struct.
   *
   * Catcache inval events are initially driven by detecting tuple inserts,
   * updates and deletions in system catalogs (see CacheInvalidateHeapTuple).
***************
*** 57,71 ****
   * sent immediately when the underlying file change is made.
   */

  typedef struct
  {
!     int8        id;                /* cache ID --- must be first */
      Oid            dbId;            /* database ID, or 0 if a shared relation */
      uint32        hashValue;        /* hash value of key for this catcache */
  } SharedInvalCatcacheMsg;

- #define SHAREDINVALCATALOG_ID    (-1)
-
  typedef struct
  {
      int8        id;                /* type field --- must be first */
--- 55,78 ----
   * sent immediately when the underlying file change is made.
   */

+ typedef enum SharedInvalMsgType
+ {
+     SharedInvalCatcache,
+     SharedInvalCatalog,
+     SharedInvalRelcache,
+     SharedInvalSmgr,
+     SharedInvalRelmap,
+     SharedInvalSnapshot
+ } SharedInvalMsgType;
+
  typedef struct
  {
!     int8        id;                /* type field --- must be first */
!     int8        cacheId;        /* cache ID */
      Oid            dbId;            /* database ID, or 0 if a shared relation */
      uint32        hashValue;        /* hash value of key for this catcache */
  } SharedInvalCatcacheMsg;

  typedef struct
  {
      int8        id;                /* type field --- must be first */
*************** typedef struct
*** 73,80 ****
      Oid            catId;            /* ID of catalog whose contents are invalid */
  } SharedInvalCatalogMsg;

- #define SHAREDINVALRELCACHE_ID    (-2)
-
  typedef struct
  {
      int8        id;                /* type field --- must be first */
--- 80,85 ----
*************** typedef struct
*** 82,89 ****
      Oid            relId;            /* relation ID, or 0 if whole relcache */
  } SharedInvalRelcacheMsg;

- #define SHAREDINVALSMGR_ID        (-3)
-
  typedef struct
  {
      /* note: field layout chosen to pack into 16 bytes */
--- 87,92 ----
*************** typedef struct
*** 93,108 ****
      RelFileNode rnode;            /* spcNode, dbNode, relNode */
  } SharedInvalSmgrMsg;

- #define SHAREDINVALRELMAP_ID    (-4)
-
  typedef struct
  {
      int8        id;                /* type field --- must be first */
      Oid            dbId;            /* database ID, or 0 for shared catalogs */
  } SharedInvalRelmapMsg;

- #define SHAREDINVALSNAPSHOT_ID    (-5)
-
  typedef struct
  {
      int8        id;                /* type field --- must be first */
--- 96,107 ----
diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index a7f367c..79823c2 100644
*** a/src/backend/access/rmgrdesc/standbydesc.c
--- b/src/backend/access/rmgrdesc/standbydesc.c
*************** standby_desc_invalidations(StringInfo bu
*** 113,120 ****

          switch ((SharedInvalMsgType) msg->id)
          {
!             case SharedInvalCatcache:
!                 appendStringInfo(buf, " catcache %d", msg->cc.cacheId);
                  break;
              case SharedInvalCatalog:
                  appendStringInfo(buf, " catalog %u", msg->cat.catId);
--- 113,123 ----

          switch ((SharedInvalMsgType) msg->id)
          {
!             case SharedInvalCatcacheHash:
!                 appendStringInfo(buf, " catcache %d by hash", msg->cch.cacheId);
!                 break;
!             case SharedInvalCatcacheOid:
!                 appendStringInfo(buf, " catcache %d by OID", msg->cco.cacheId);
                  break;
              case SharedInvalCatalog:
                  appendStringInfo(buf, " catalog %u", msg->cat.catId);
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 472285d..ebf4321 100644
*** a/src/backend/catalog/heap.c
--- b/src/backend/catalog/heap.c
*************** RemoveStatistics(Oid relid, AttrNumber a
*** 3025,3030 ****
--- 3025,3048 ----

      systable_endscan(scan);

+     /*
+      * Aside from removing the catalog entries, issue sinval messages to
+      * remove any negative catcache entries for stats that weren't present.
+      * (Positive entries will get flushed as a consequence of deleting the
+      * catalog entries.)  Without this, repeatedly creating and dropping temp
+      * tables tends to lead to catcache bloat, since any negative catcache
+      * entries created by planner lookups won't get dropped.
+      *
+      * We only bother with this for the whole-table case, since (a) it's less
+      * likely to be a problem for DROP COLUMN, and (b) the sinval
+      * infrastructure only supports matching an OID cache key column.
+      * (Alternatively, we could issue the sinval message always, accepting the
+      * collateral damage of losing negative catcache entries for other columns
+      * to be sure we get rid of entries for this one.)
+      */
+     if (attnum == 0)
+         CacheInvalidateCatcacheByOid(STATRELATTINH, false, 1, relid);
+
      heap_close(pgstatistic, RowExclusiveLock);
  }

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 8152f7e..81c01f6 100644
*** a/src/backend/utils/cache/catcache.c
--- b/src/backend/utils/cache/catcache.c
*************** CatCacheRemoveCList(CatCache *cache, Cat
*** 540,546 ****


  /*
!  *    CatCacheInvalidate
   *
   *    Invalidate entries in the specified cache, given a hash value.
   *
--- 540,546 ----


  /*
!  *    CatCacheInvalidateByHash
   *
   *    Invalidate entries in the specified cache, given a hash value.
   *
*************** CatCacheRemoveCList(CatCache *cache, Cat
*** 558,569 ****
   *    This routine is only quasi-public: it should only be used by inval.c.
   */
  void
! CatCacheInvalidate(CatCache *cache, uint32 hashValue)
  {
      Index        hashIndex;
      dlist_mutable_iter iter;

!     CACHE1_elog(DEBUG2, "CatCacheInvalidate: called");

      /*
       * We don't bother to check whether the cache has finished initialization
--- 558,569 ----
   *    This routine is only quasi-public: it should only be used by inval.c.
   */
  void
! CatCacheInvalidateByHash(CatCache *cache, uint32 hashValue)
  {
      Index        hashIndex;
      dlist_mutable_iter iter;

!     CACHE1_elog(DEBUG2, "CatCacheInvalidateByHash: called");

      /*
       * We don't bother to check whether the cache has finished initialization
*************** CatCacheInvalidate(CatCache *cache, uint
*** 603,609 ****
              }
              else
                  CatCacheRemoveCTup(cache, ct);
!             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
  #ifdef CATCACHE_STATS
              cache->cc_invals++;
  #endif
--- 603,609 ----
              }
              else
                  CatCacheRemoveCTup(cache, ct);
!             CACHE1_elog(DEBUG2, "CatCacheInvalidateByHash: invalidated");
  #ifdef CATCACHE_STATS
              cache->cc_invals++;
  #endif
*************** CatCacheInvalidate(CatCache *cache, uint
*** 612,617 ****
--- 612,683 ----
      }
  }

+ /*
+  *    CatCacheInvalidateByOid
+  *
+  *    Invalidate negative entries in the specified cache, given a target OID.
+  *
+  *    We delete negative cache entries that have that OID value in column ckey.
+  *    While we could also examine positive entries, there's no need to do so in
+  *    current usage: any relevant positive entries should have been flushed by
+  *    CatCacheInvalidateByHash calls due to deletions of those catalog entries.
+  *
+  *    This routine is only quasi-public: it should only be used by inval.c.
+  */
+ void
+ CatCacheInvalidateByOid(CatCache *cache, int ckey, Oid oid)
+ {
+     dlist_mutable_iter iter;
+     int            i;
+
+     CACHE1_elog(DEBUG2, "CatCacheInvalidateByOid: called");
+
+     /* If the cache hasn't finished initialization, there's nothing to do */
+     if (cache->cc_tupdesc == NULL)
+         return;
+
+     /* Assert that an OID column has been targeted */
+     Assert(TupleDescAttr(cache->cc_tupdesc,
+                          cache->cc_keyno[ckey - 1] - 1)->atttypid == OIDOID);
+
+     /*
+      * There seems no need to flush CatCLists; removal of negative entries
+      * shouldn't affect the validity of searches.
+      */
+
+     /*
+      * Scan the whole cache for matches
+      */
+     for (i = 0; i < cache->cc_nbuckets; i++)
+     {
+         dlist_head *bucket = &cache->cc_bucket[i];
+
+         dlist_foreach_modify(iter, bucket)
+         {
+             CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+             /* We only care about live negative entries */
+             if (ct->dead || !ct->negative)
+                 continue;
+             /* Negative entries won't be in clists */
+             Assert(ct->c_list == NULL);
+
+             if (oid == DatumGetObjectId(ct->keys[ckey - 1]))
+             {
+                 if (ct->refcount > 0)
+                     ct->dead = true;
+                 else
+                     CatCacheRemoveCTup(cache, ct);
+                 CACHE1_elog(DEBUG2, "CatCacheInvalidateByOid: invalidated");
+ #ifdef CATCACHE_STATS
+                 cache->cc_invals++;
+ #endif
+                 /* could be multiple matches, so keep looking! */
+             }
+         }
+     }
+ }
+
  /* ----------------------------------------------------------------
   *                       public functions
   * ----------------------------------------------------------------
*************** CatCacheCopyKeys(TupleDesc tupdesc, int
*** 1995,2001 ****
   *    the specified relation, find all catcaches it could be in, compute the
   *    correct hash value for each such catcache, and call the specified
   *    function to record the cache id and hash value in inval.c's lists.
!  *    SysCacheInvalidate will be called later, if appropriate,
   *    using the recorded information.
   *
   *    For an insert or delete, tuple is the target tuple and newtuple is NULL.
--- 2061,2067 ----
   *    the specified relation, find all catcaches it could be in, compute the
   *    correct hash value for each such catcache, and call the specified
   *    function to record the cache id and hash value in inval.c's lists.
!  *    SysCacheInvalidateByHash will be called later, if appropriate,
   *    using the recorded information.
   *
   *    For an insert or delete, tuple is the target tuple and newtuple is NULL.
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 5bc08b0..168a97d 100644
*** a/src/backend/utils/cache/inval.c
--- b/src/backend/utils/cache/inval.c
*************** AppendInvalidationMessageList(Invalidati
*** 331,349 ****
   */

  /*
!  * Add a catcache inval entry
   */
  static void
! AddCatcacheInvalidationMessage(InvalidationListHeader *hdr,
!                                int id, uint32 hashValue, Oid dbId)
  {
      SharedInvalidationMessage msg;

      Assert(id < CHAR_MAX);
!     msg.cc.id = SharedInvalCatcache;
!     msg.cc.cacheId = (int8) id;
!     msg.cc.dbId = dbId;
!     msg.cc.hashValue = hashValue;

      /*
       * Define padding bytes in SharedInvalidationMessage structs to be
--- 331,349 ----
   */

  /*
!  * Add a catcache inval-by-hash entry
   */
  static void
! AddCatcacheHashInvalidationMessage(InvalidationListHeader *hdr,
!                                    int id, uint32 hashValue, Oid dbId)
  {
      SharedInvalidationMessage msg;

      Assert(id < CHAR_MAX);
!     msg.cch.id = SharedInvalCatcacheHash;
!     msg.cch.cacheId = (int8) id;
!     msg.cch.dbId = dbId;
!     msg.cch.hashValue = hashValue;

      /*
       * Define padding bytes in SharedInvalidationMessage structs to be
*************** AddCatcacheInvalidationMessage(Invalidat
*** 360,365 ****
--- 360,386 ----
  }

  /*
+  * Add a catcache inval-by-OID entry
+  */
+ static void
+ AddCatcacheOidInvalidationMessage(InvalidationListHeader *hdr,
+                                   int id, int ckey, Oid oid, Oid dbId)
+ {
+     SharedInvalidationMessage msg;
+
+     Assert(id < CHAR_MAX);
+     msg.cco.id = SharedInvalCatcacheOid;
+     msg.cco.cacheId = (int8) id;
+     msg.cco.ckey = (int8) ckey;
+     msg.cco.oid = oid;
+     msg.cco.dbId = dbId;
+     /* check AddCatcacheHashInvalidationMessage() for an explanation */
+     VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+     AddInvalidationMessage(&hdr->cclist, &msg);
+ }
+
+ /*
   * Add a whole-catalog inval entry
   */
  static void
*************** AddCatalogInvalidationMessage(Invalidati
*** 371,377 ****
      msg.cat.id = SharedInvalCatalog;
      msg.cat.dbId = dbId;
      msg.cat.catId = catId;
!     /* check AddCatcacheInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      AddInvalidationMessage(&hdr->cclist, &msg);
--- 392,398 ----
      msg.cat.id = SharedInvalCatalog;
      msg.cat.dbId = dbId;
      msg.cat.catId = catId;
!     /* check AddCatcacheHashInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      AddInvalidationMessage(&hdr->cclist, &msg);
*************** AddRelcacheInvalidationMessage(Invalidat
*** 401,407 ****
      msg.rc.id = SharedInvalRelcache;
      msg.rc.dbId = dbId;
      msg.rc.relId = relId;
!     /* check AddCatcacheInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      AddInvalidationMessage(&hdr->rclist, &msg);
--- 422,428 ----
      msg.rc.id = SharedInvalRelcache;
      msg.rc.dbId = dbId;
      msg.rc.relId = relId;
!     /* check AddCatcacheHashInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      AddInvalidationMessage(&hdr->rclist, &msg);
*************** AddSnapshotInvalidationMessage(Invalidat
*** 427,433 ****
      msg.sn.id = SharedInvalSnapshot;
      msg.sn.dbId = dbId;
      msg.sn.relId = relId;
!     /* check AddCatcacheInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      AddInvalidationMessage(&hdr->rclist, &msg);
--- 448,454 ----
      msg.sn.id = SharedInvalSnapshot;
      msg.sn.dbId = dbId;
      msg.sn.relId = relId;
!     /* check AddCatcacheHashInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      AddInvalidationMessage(&hdr->rclist, &msg);
*************** ProcessInvalidationMessagesMulti(Invalid
*** 477,493 ****
   */

  /*
!  * RegisterCatcacheInvalidation
   *
!  * Register an invalidation event for a catcache tuple entry.
   */
  static void
! RegisterCatcacheInvalidation(int cacheId,
!                              uint32 hashValue,
!                              Oid dbId)
  {
!     AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
!                                    cacheId, hashValue, dbId);
  }

  /*
--- 498,529 ----
   */

  /*
!  * RegisterCatcacheHashInvalidation
   *
!  * Register an invalidation event for a catcache tuple entry identified
!  * by hash value.
   */
  static void
! RegisterCatcacheHashInvalidation(int cacheId,
!                                  uint32 hashValue,
!                                  Oid dbId)
  {
!     AddCatcacheHashInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
!                                        cacheId, hashValue, dbId);
! }
!
! /*
!  * RegisterCatcacheOidInvalidation
!  *
!  * Register an invalidation event for catcache tuple entries having
!  * the specified OID in a particular cache key column.
!  */
! static void
! RegisterCatcacheOidInvalidation(int cacheId,
!                                 int ckey, Oid oid, Oid dbId)
! {
!     AddCatcacheOidInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
!                                       cacheId, ckey, oid, dbId);
  }

  /*
*************** LocalExecuteInvalidationMessage(SharedIn
*** 556,569 ****
  {
      switch ((SharedInvalMsgType) msg->id)
      {
!         case SharedInvalCatcache:
!             if (msg->cc.dbId == MyDatabaseId || msg->cc.dbId == InvalidOid)
              {
                  InvalidateCatalogSnapshot();

!                 SysCacheInvalidate(msg->cc.cacheId, msg->cc.hashValue);

!                 CallSyscacheCallbacks(msg->cc.cacheId, msg->cc.hashValue);
              }
              break;
          case SharedInvalCatalog:
--- 592,612 ----
  {
      switch ((SharedInvalMsgType) msg->id)
      {
!         case SharedInvalCatcacheHash:
!             if (msg->cch.dbId == MyDatabaseId || msg->cch.dbId == InvalidOid)
              {
                  InvalidateCatalogSnapshot();

!                 SysCacheInvalidateByHash(msg->cch.cacheId, msg->cch.hashValue);

!                 CallSyscacheCallbacks(msg->cch.cacheId, msg->cch.hashValue);
!             }
!             break;
!         case SharedInvalCatcacheOid:
!             if (msg->cco.dbId == MyDatabaseId || msg->cco.dbId == InvalidOid)
!             {
!                 SysCacheInvalidateByOid(msg->cco.cacheId, msg->cco.ckey,
!                                         msg->cco.oid);
              }
              break;
          case SharedInvalCatalog:
*************** CacheInvalidateHeapTuple(Relation relati
*** 1157,1163 ****
      }
      else
          PrepareToInvalidateCacheTuple(relation, tuple, newtuple,
!                                       RegisterCatcacheInvalidation);

      /*
       * Now, is this tuple one of the primary definers of a relcache entry? See
--- 1200,1206 ----
      }
      else
          PrepareToInvalidateCacheTuple(relation, tuple, newtuple,
!                                       RegisterCatcacheHashInvalidation);

      /*
       * Now, is this tuple one of the primary definers of a relcache entry? See
*************** CacheInvalidateHeapTuple(Relation relati
*** 1217,1222 ****
--- 1260,1286 ----
  }

  /*
+  * CacheInvalidateCatcacheByOid
+  *        Register invalidation of catcache entries referencing a given OID.
+  *
+  * This is used to kill negative catcache entries that are believed to be
+  * no longer useful.  The entries are identified by which cache they are
+  * in, the cache key column to look at, and the target OID.
+  *
+  * Note: we expect caller to know whether the specified cache is on a
+  * shared or local system catalog.  We could ask syscache.c for that info,
+  * but it seems probably not worth the trouble, since this is likely to
+  * have few callers.
+  */
+ void
+ CacheInvalidateCatcacheByOid(int cacheId, bool isshared, int ckey, Oid oid)
+ {
+     Oid            dbId = isshared ? (Oid) 0 : MyDatabaseId;
+
+     RegisterCatcacheOidInvalidation(cacheId, ckey, oid, dbId);
+ }
+
+ /*
   * CacheInvalidateCatalog
   *        Register invalidation of the whole content of a system catalog.
   *
*************** CacheInvalidateSmgr(RelFileNodeBackend r
*** 1359,1365 ****
      msg.sm.backend_hi = rnode.backend >> 16;
      msg.sm.backend_lo = rnode.backend & 0xffff;
      msg.sm.rnode = rnode.node;
!     /* check AddCatcacheInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      SendSharedInvalidMessages(&msg, 1);
--- 1423,1429 ----
      msg.sm.backend_hi = rnode.backend >> 16;
      msg.sm.backend_lo = rnode.backend & 0xffff;
      msg.sm.rnode = rnode.node;
!     /* check AddCatcacheHashInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      SendSharedInvalidMessages(&msg, 1);
*************** CacheInvalidateRelmap(Oid databaseId)
*** 1387,1393 ****

      msg.rm.id = SharedInvalRelmap;
      msg.rm.dbId = databaseId;
!     /* check AddCatcacheInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      SendSharedInvalidMessages(&msg, 1);
--- 1451,1457 ----

      msg.rm.id = SharedInvalRelmap;
      msg.rm.dbId = databaseId;
!     /* check AddCatcacheHashInvalidationMessage() for an explanation */
      VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));

      SendSharedInvalidMessages(&msg, 1);
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19..3e5acd5 100644
*** a/src/backend/utils/cache/syscache.c
--- b/src/backend/utils/cache/syscache.c
*************** SearchSysCacheList(int cacheId, int nkey
*** 1434,1448 ****
  }

  /*
!  * SysCacheInvalidate
   *
   *    Invalidate entries in the specified cache, given a hash value.
!  *    See CatCacheInvalidate() for more info.
   *
   *    This routine is only quasi-public: it should only be used by inval.c.
   */
  void
! SysCacheInvalidate(int cacheId, uint32 hashValue)
  {
      if (cacheId < 0 || cacheId >= SysCacheSize)
          elog(ERROR, "invalid cache ID: %d", cacheId);
--- 1434,1448 ----
  }

  /*
!  * SysCacheInvalidateByHash
   *
   *    Invalidate entries in the specified cache, given a hash value.
!  *    See CatCacheInvalidateByHash() for more info.
   *
   *    This routine is only quasi-public: it should only be used by inval.c.
   */
  void
! SysCacheInvalidateByHash(int cacheId, uint32 hashValue)
  {
      if (cacheId < 0 || cacheId >= SysCacheSize)
          elog(ERROR, "invalid cache ID: %d", cacheId);
*************** SysCacheInvalidate(int cacheId, uint32 h
*** 1451,1457 ****
      if (!PointerIsValid(SysCache[cacheId]))
          return;

!     CatCacheInvalidate(SysCache[cacheId], hashValue);
  }

  /*
--- 1451,1478 ----
      if (!PointerIsValid(SysCache[cacheId]))
          return;

!     CatCacheInvalidateByHash(SysCache[cacheId], hashValue);
! }
!
! /*
!  * SysCacheInvalidateByOid
!  *
!  *    Invalidate negative entries in the specified cache, given a target OID.
!  *    See CatCacheInvalidateByOid() for more info.
!  *
!  *    This routine is only quasi-public: it should only be used by inval.c.
!  */
! void
! SysCacheInvalidateByOid(int cacheId, int ckey, Oid oid)
! {
!     if (cacheId < 0 || cacheId >= SysCacheSize)
!         elog(ERROR, "invalid cache ID: %d", cacheId);
!
!     /* if this cache isn't initialized yet, no need to do anything */
!     if (!PointerIsValid(SysCache[cacheId]))
!         return;
!
!     CatCacheInvalidateByOid(SysCache[cacheId], ckey, oid);
  }

  /*
diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h
index d0d9ece..004cb45 100644
*** a/src/include/storage/sinval.h
--- b/src/include/storage/sinval.h
***************
*** 20,26 ****

  /*
   * We support several types of shared-invalidation messages:
!  *    * invalidate a specific tuple in a specific catcache
   *    * invalidate all catcache entries from a given system catalog
   *    * invalidate a relcache entry for a specific logical relation
   *    * invalidate all relcache entries
--- 20,27 ----

  /*
   * We support several types of shared-invalidation messages:
!  *    * invalidate a specific tuple (identified by hash) in a specific catcache
!  *    * invalidate negative entries matching a given OID in a specific catcache
   *    * invalidate all catcache entries from a given system catalog
   *    * invalidate a relcache entry for a specific logical relation
   *    * invalidate all relcache entries
***************
*** 30,36 ****
   * More types could be added if needed.  The message type is identified by
   * the first "int8" field of the message struct.
   *
!  * Catcache inval events are initially driven by detecting tuple inserts,
   * updates and deletions in system catalogs (see CacheInvalidateHeapTuple).
   * An update can generate two inval events, one for the old tuple and one for
   * the new, but this is reduced to one event if the tuple's hash key doesn't
--- 31,37 ----
   * More types could be added if needed.  The message type is identified by
   * the first "int8" field of the message struct.
   *
!  * Catcache hash inval events are initially driven by detecting tuple inserts,
   * updates and deletions in system catalogs (see CacheInvalidateHeapTuple).
   * An update can generate two inval events, one for the old tuple and one for
   * the new, but this is reduced to one event if the tuple's hash key doesn't
***************
*** 57,63 ****

  typedef enum SharedInvalMsgType
  {
!     SharedInvalCatcache,
      SharedInvalCatalog,
      SharedInvalRelcache,
      SharedInvalSmgr,
--- 58,65 ----

  typedef enum SharedInvalMsgType
  {
!     SharedInvalCatcacheHash,
!     SharedInvalCatcacheOid,
      SharedInvalCatalog,
      SharedInvalRelcache,
      SharedInvalSmgr,
*************** typedef struct
*** 71,77 ****
      int8        cacheId;        /* cache ID */
      Oid            dbId;            /* database ID, or 0 if a shared relation */
      uint32        hashValue;        /* hash value of key for this catcache */
! } SharedInvalCatcacheMsg;

  typedef struct
  {
--- 73,88 ----
      int8        cacheId;        /* cache ID */
      Oid            dbId;            /* database ID, or 0 if a shared relation */
      uint32        hashValue;        /* hash value of key for this catcache */
! } SharedInvalCatcacheHashMsg;
!
! typedef struct
! {
!     int8        id;                /* type field --- must be first */
!     int8        cacheId;        /* cache ID */
!     int8        ckey;            /* cache key column (1..CATCACHE_MAXKEYS) */
!     Oid            oid;            /* OID of cache entries to remove */
!     Oid            dbId;            /* database ID, or 0 if a shared relation */
! } SharedInvalCatcacheOidMsg;

  typedef struct
  {
*************** typedef struct
*** 112,118 ****
  typedef union
  {
      int8        id;                /* type field --- must be first */
!     SharedInvalCatcacheMsg cc;
      SharedInvalCatalogMsg cat;
      SharedInvalRelcacheMsg rc;
      SharedInvalSmgrMsg sm;
--- 123,130 ----
  typedef union
  {
      int8        id;                /* type field --- must be first */
!     SharedInvalCatcacheHashMsg cch;
!     SharedInvalCatcacheOidMsg cco;
      SharedInvalCatalogMsg cat;
      SharedInvalRelcacheMsg rc;
      SharedInvalSmgrMsg sm;
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a..47b72d6 100644
*** a/src/include/utils/catcache.h
--- b/src/include/utils/catcache.h
*************** extern void ReleaseCatCacheList(CatCList
*** 219,225 ****

  extern void ResetCatalogCaches(void);
  extern void CatalogCacheFlushCatalog(Oid catId);
! extern void CatCacheInvalidate(CatCache *cache, uint32 hashValue);
  extern void PrepareToInvalidateCacheTuple(Relation relation,
                                HeapTuple tuple,
                                HeapTuple newtuple,
--- 219,226 ----

  extern void ResetCatalogCaches(void);
  extern void CatalogCacheFlushCatalog(Oid catId);
! extern void CatCacheInvalidateByHash(CatCache *cache, uint32 hashValue);
! extern void CatCacheInvalidateByOid(CatCache *cache, int ckey, Oid oid);
  extern void PrepareToInvalidateCacheTuple(Relation relation,
                                HeapTuple tuple,
                                HeapTuple newtuple,
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index c557640..d1181bc 100644
*** a/src/include/utils/inval.h
--- b/src/include/utils/inval.h
*************** extern void CacheInvalidateHeapTuple(Rel
*** 39,44 ****
--- 39,47 ----
                           HeapTuple tuple,
                           HeapTuple newtuple);

+ extern void CacheInvalidateCatcacheByOid(int cacheId, bool isshared,
+                              int ckey, Oid oid);
+
  extern void CacheInvalidateCatalog(Oid catalogId);

  extern void CacheInvalidateRelcache(Relation relation);
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee489..983fd00 100644
*** a/src/include/utils/syscache.h
--- b/src/include/utils/syscache.h
*************** struct catclist;
*** 159,165 ****
  extern struct catclist *SearchSysCacheList(int cacheId, int nkeys,
                     Datum key1, Datum key2, Datum key3);

! extern void SysCacheInvalidate(int cacheId, uint32 hashValue);

  extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
  extern bool RelationHasSysCache(Oid relid);
--- 159,166 ----
  extern struct catclist *SearchSysCacheList(int cacheId, int nkeys,
                     Datum key1, Datum key2, Datum key3);

! extern void SysCacheInvalidateByHash(int cacheId, uint32 hashValue);
! extern void SysCacheInvalidateByOid(int cacheId, int ckey, Oid oid);

  extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
  extern bool RelationHasSysCache(Oid relid);

RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> I'm really disappointed by the direction this thread is going in.
> The latest patches add an enormous amount of mechanism, and user-visible
> complexity, to do something that we learned was a bad idea decades ago.
> Putting a limit on the size of the syscaches doesn't accomplish anything
> except to add cycles if your cache working set is below the limit, or make
> performance fall off a cliff if it's above the limit.  I don't think there's
> any reason to believe that making it more complicated will avoid that
> problem.
> 
> What does seem promising is something similar to Horiguchi-san's original
> patches all the way back at
> 
> https://www.postgresql.org/message-id/20161219.201505.11562604.horiguc
> hi.kyotaro@lab.ntt.co.jp

> so I'd been thinking about ways to fix that case in particular.

You're suggesting to go back to the original issue (bloat by negative cache entries) and give simpler solution to it
once,aren't you?  That may be the way to go.
 

But the syscache/relcache bloat still remains a problem, when there are many live tables and application connections.
Wouldyou agree to solve this in some way?  I thought Horiguchi-san's latest patches would solve this and the negative
entries. Can we consider that his patch and yours are orthogonal, i.e., we can pursue Horiguchi-san's patch after yours
iscommitted?
 

(As you said, some parts of Horiguchi-san's patches may be made simpler.  For example, the ability to change another
session'sGUC variable can be discussed in a separate thread.)
 

I think we need some limit to the size of the relcache, syscache, and plancache.  Oracle and MySQL both have it, using
LRUto evict less frequently used entries.  You seem to be concerned about the LRU management based on your experience,
butwould it really cost so much as long as each postgres process can change the LRU list without coordination with
otherbackends now?  Could you share your experience?
 

FYI, Oracle provides one parameter, shared_pool_size, that determine the size of a memory area that contains SQL plans
andvarious dictionary objects.  Oracle decides how to divide the area among constituents.  So it could be possible that
onecomponent (e.g. table/index metadata) is short of space, and another (e.g. SQL plans) has free space.  Oracle
providesa system view to see the free space and hit/miss of each component.  If one component suffers from memory
shortage,the user increases shared_pool_size.  This is similar to what Horiguchi-san is proposing.
 

MySQL enables fine-tuning of each component.  It provides the size parameters for six memory partitions of the
dictionaryobject cache, and the usage statistics of those partitions through the Performance Schema.
 

tablespace definition cache
schema definition cache
table definition cache
stored program definition cache
character set definition cache
collation definition cache

I wonder whether we can group existing relcache/syscache entries like this.



[MySQL]
14.4 Dictionary Object Cache
https://dev.mysql.com/doc/refman/8.0/en/data-dictionary-object-cache.html
--------------------------------------------------
The dictionary object cache is a shared global cache that stores previously accessed data dictionary objects in memory
toenable object reuse and minimize disk I/O. Similar to other cache mechanisms used by MySQL, the dictionary object
cacheuses an LRU-based eviction strategy to evict least recently used objects from memory.
 

The dictionary object cache comprises cache partitions that store different object types. Some cache partition size
limitsare configurable, whereas others are hardcoded.
 
--------------------------------------------------


8.12.3.1 How MySQL Uses Memory
https://dev.mysql.com/doc/refman/8.0/en/memory-use.html
--------------------------------------------------
table_open_cache
MySQL requires memory and descriptors for the table cache.

table_definition_cache
For InnoDB, table_definition_cache acts as a soft limit for the number of open table instances in the InnoDB data
dictionarycache. If the number of open table instances exceeds the table_definition_cache setting, the LRU mechanism
beginsto mark table instances for eviction and eventually removes them from the data dictionary cache. The limit helps
addresssituations in which significant amounts of memory would be used to cache rarely used table instances until the
nextserver restart.
 
--------------------------------------------------

Regards
Takayuki Tsunakawa





Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:
> But the syscache/relcache bloat still remains a problem, when there are many live tables and application connections.
Would you agree to solve this in some way?  I thought Horiguchi-san's latest patches would solve this and the negative
entries. Can we consider that his patch and yours are orthogonal, i.e., we can pursue Horiguchi-san's patch after yours
iscommitted? 

Certainly, what I've done here doesn't preclude adding some wider solution to
the issue of extremely large catcaches.  I think it takes the pressure off
for one rather narrow problem case, and the mechanism could be used to fix
other ones.  But if you've got an application that just plain accesses a
huge number of objects, this isn't going to make your life better.

> (As you said, some parts of Horiguchi-san's patches may be made simpler.  For example, the ability to change another
session'sGUC variable can be discussed in a separate thread.) 

Yeah, that idea seems just bad from here ...

> I think we need some limit to the size of the relcache, syscache, and plancache.  Oracle and MySQL both have it,
usingLRU to evict less frequently used entries.  You seem to be concerned about the LRU management based on your
experience,but would it really cost so much as long as each postgres process can change the LRU list without
coordinationwith other backends now?  Could you share your experience? 

Well, we *had* an LRU mechanism for the catcaches way back when.  We got
rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU
info was expensive and (b) performance fell off a cliff in scenarios where
the cache size limit was exceeded.  You could probably find some more info
about that by scanning the mail list archives from around the time of that
commit, but I'm too lazy to do so right now.

That was a dozen years ago, and it's possible that machine performance
has moved so much since then that the problems are gone or mitigated.
In particular I'm sure that any limit we would want to impose today will
be far more than the 5000-entries-across-all-caches limit that was in use
back then.  But I'm not convinced that a workload that would create 100K
cache entries in the first place wouldn't have severe problems if you
tried to constrain it to use only 80K entries.  I fear it's just wishful
thinking to imagine that the behavior of a larger cache won't be just
like a smaller one.  Also, IIRC some of the problem with the LRU code
was that it resulted in lots of touches of unrelated data, leading to
CPU cache miss problems.  It's hard to see how that doesn't get even
worse with a bigger cache.

As far as the relcache goes, we've never had a limit on that, but there
are enough routine causes of relcache flushes --- autovacuum for instance
--- that I'm not really convinced relcache bloat can be a big problem in
production.

The plancache has never had a limit either, which is a design choice that
was strongly influenced by our experience with catcaches.  Again, I'm
concerned about the costs of adding a management layer, and the likelihood
that cache flushes will simply remove entries we'll soon have to rebuild.

> FYI, Oracle provides one parameter, shared_pool_size, that determine the
> size of a memory area that contains SQL plans and various dictionary
> objects.  Oracle decides how to divide the area among constituents.  So
> it could be possible that one component (e.g. table/index metadata) is
> short of space, and another (e.g. SQL plans) has free space.  Oracle
> provides a system view to see the free space and hit/miss of each
> component.  If one component suffers from memory shortage, the user
> increases shared_pool_size.  This is similar to what Horiguchi-san is
> proposing.

Oracle seldom impresses me as having designs we ought to follow.
They have a well-earned reputation for requiring a lot of expertise to
operate, which is not the direction this project should be going in.
In particular, I don't want to "solve" cache size issues by exposing
a bunch of knobs that most users won't know how to twiddle.

            regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
"andres@anarazel.de"
Дата:
Hi,

On 2019-01-15 13:32:36 -0500, Tom Lane wrote:
> Well, we *had* an LRU mechanism for the catcaches way back when.  We got
> rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU
> info was expensive and (b) performance fell off a cliff in scenarios where
> the cache size limit was exceeded.  You could probably find some more info
> about that by scanning the mail list archives from around the time of that
> commit, but I'm too lazy to do so right now.
> 
> That was a dozen years ago, and it's possible that machine performance
> has moved so much since then that the problems are gone or mitigated.
> In particular I'm sure that any limit we would want to impose today will
> be far more than the 5000-entries-across-all-caches limit that was in use
> back then.  But I'm not convinced that a workload that would create 100K
> cache entries in the first place wouldn't have severe problems if you
> tried to constrain it to use only 80K entries.

I think that'd be true if you the accesses were truly randomly
distributed - but that's not the case in the cases where I've seen huge
caches.  It's usually workloads that have tons of functions, partitions,
... and a lot of them are not that frequently accessed, but because we
have no cache purging mechanism stay around for a long time. This is
often exascerbated by using a pooler to keep connections around for
longer (which you have to, to cope with other limits of PG).


> As far as the relcache goes, we've never had a limit on that, but there
> are enough routine causes of relcache flushes --- autovacuum for instance
> --- that I'm not really convinced relcache bloat can be a big problem in
> production.

It definitely is.


> The plancache has never had a limit either, which is a design choice that
> was strongly influenced by our experience with catcaches.

This sounds a lot of having learned lessons from one bad implementation
and using it far outside of that situation.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Kenneth Marshall
Дата:
On Tue, Jan 15, 2019 at 01:32:36PM -0500, Tom Lane wrote:
> ... 
> > FYI, Oracle provides one parameter, shared_pool_size, that determine the
> > size of a memory area that contains SQL plans and various dictionary
> > objects.  Oracle decides how to divide the area among constituents.  So
> > it could be possible that one component (e.g. table/index metadata) is
> > short of space, and another (e.g. SQL plans) has free space.  Oracle
> > provides a system view to see the free space and hit/miss of each
> > component.  If one component suffers from memory shortage, the user
> > increases shared_pool_size.  This is similar to what Horiguchi-san is
> > proposing.
> 
> Oracle seldom impresses me as having designs we ought to follow.
> They have a well-earned reputation for requiring a lot of expertise to
> operate, which is not the direction this project should be going in.
> In particular, I don't want to "solve" cache size issues by exposing
> a bunch of knobs that most users won't know how to twiddle.
> 
>             regards, tom lane

+1

Regards,
Ken 


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Certainly, what I've done here doesn't preclude adding some wider solution
> to
> the issue of extremely large catcaches.

I'm relieved to hear that.

> I think it takes the pressure off
> for one rather narrow problem case, and the mechanism could be used to fix
> other ones.  But if you've got an application that just plain accesses a
> huge number of objects, this isn't going to make your life better.

I understand you're trying to solve the problem caused by negative cache entries as soon as possible, because the user
isreally suffering from it.  I feel sympathy with that attitude, because you seem to be always addressing issues that
othersare reluctant to take.  That's one of the reasons I respect you.
 


> Well, we *had* an LRU mechanism for the catcaches way back when.  We got
> rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU
> info was expensive and (b) performance fell off a cliff in scenarios where
> the cache size limit was exceeded.  You could probably find some more info
> about that by scanning the mail list archives from around the time of that
> commit, but I'm too lazy to do so right now.

Oh, in 2006...  I'll examine the patch and the discussion to see how the LRU management was done.


> That was a dozen years ago, and it's possible that machine performance
> has moved so much since then that the problems are gone or mitigated.

I really, really hope so.  Even if we see some visible impact by the LRU management, I think that's the debt PostgreSQL
hadto pay for but doesn't now.  Even the single-process MySQL, which doesn't suffer from cache bloat for many server
processes,has the ability to limit the cache.  And PostgreSQL has many parameters for various memory components such as
shared_buffers,wal_buffers, work_mem, etc, so it would be reasonable to also have the limit for the catalog caches.
Thatsaid, we can avoid the penalty and retain the current performance by disabling the limit (some_size_param = 0).
 

I think we'll evaluate the impact of LRU management by adding prev and next members to catcache and relcache
structures,and putting the entry at the front (or back) of the LRU chain every time the entry is obtained.  I think
pgbench'sselect-only mode is enough for evaluation.  I'd like to hear if any other workload is more appropriate to see
theCPU cache effect.
 


> In particular I'm sure that any limit we would want to impose today will
> be far more than the 5000-entries-across-all-caches limit that was in use
> back then.  But I'm not convinced that a workload that would create 100K
> cache entries in the first place wouldn't have severe problems if you
> tried to constrain it to use only 80K entries.  I fear it's just wishful
> thinking to imagine that the behavior of a larger cache won't be just
> like a smaller one.  Also, IIRC some of the problem with the LRU code
> was that it resulted in lots of touches of unrelated data, leading to
> CPU cache miss problems.  It's hard to see how that doesn't get even
> worse with a bigger cache.
>
> As far as the relcache goes, we've never had a limit on that, but there
> are enough routine causes of relcache flushes --- autovacuum for instance
> --- that I'm not really convinced relcache bloat can be a big problem in
> production.

As Andres and Robert mentioned, we want to free less frequently used cache entries.  Otherwise, we're now suffering
fromthe bloat to TBs of memory.  This is a real, not hypothetical issue...
 



> The plancache has never had a limit either, which is a design choice that
> was strongly influenced by our experience with catcaches.  Again, I'm
> concerned about the costs of adding a management layer, and the likelihood
> that cache flushes will simply remove entries we'll soon have to rebuild.

Fortunately, we're not bothered with the plan cache.  But I remember you said you were annoyed by PL/pgSQL's plan cache
useat Salesforce.  Were you able to overcome it somehow?
 



> Oracle seldom impresses me as having designs we ought to follow.
> They have a well-earned reputation for requiring a lot of expertise to
> operate, which is not the direction this project should be going in.
> In particular, I don't want to "solve" cache size issues by exposing
> a bunch of knobs that most users won't know how to twiddle.


Oracle certainly seems to be difficult to use.  But they seem to be studying other DBMSs to make it simpler to use.
I'msure they also have a lot we should learn, and the cache limit if one of them (although MySQL's per-cache tuning may
bebetter.)
 

And having limits for various components would be the first step toward the autonomous database; tunable -> auto tuning
->autonomous
 



Regards
Takayuki Tsunakawa





Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Sun, Jan 13, 2019 at 11:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Putting a limit on the size of the syscaches doesn't accomplish anything
> except to add cycles if your cache working set is below the limit, or
> make performance fall off a cliff if it's above the limit.

If you're running on a Turing machine, sure.  But real machines have
finite memory, or at least all the ones I use do.  Horiguchi-san is
right that this is a real, not theoretical problem.  It is one of the
most frequent operational concerns that EnterpriseDB customers have.
I'm not against solving specific cases with more targeted fixes, but I
really believe we need something more.  Andres mentioned one problem
case: connection poolers that eventually end up with a cache entry for
every object in the system.  Another case is that of people who keep
idle connections open for long periods of time; those connections can
gobble up large amounts of memory even though they're not going to use
any of their cache entries any time soon.

The flaw in your thinking, as it seems to me, is that in your concern
for "the likelihood that cache flushes will simply remove entries
we'll soon have to rebuild," you're apparently unwilling to consider
the possibility of workloads where cache flushes will remove entries
we *won't* soon have to rebuild.  Every time that issue gets raised,
you seem to blow it off as if it were not a thing that really happens.
I can't make sense of that position.  Is it really so hard to imagine
a connection pooler that switches the same connection back and forth
between two applications with different working sets?  Or a system
that keeps persistent connections open even when they are idle?  Do
you really believe that a connection that has not accessed a cache
entry in 10 minutes still derives more benefit from that cache entry
than it would from freeing up some memory?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Bruce Momjian
Дата:
On Thu, Jan 17, 2019 at 11:33:35AM -0500, Robert Haas wrote:
> The flaw in your thinking, as it seems to me, is that in your concern
> for "the likelihood that cache flushes will simply remove entries
> we'll soon have to rebuild," you're apparently unwilling to consider
> the possibility of workloads where cache flushes will remove entries
> we *won't* soon have to rebuild.  Every time that issue gets raised,
> you seem to blow it off as if it were not a thing that really happens.
> I can't make sense of that position.  Is it really so hard to imagine
> a connection pooler that switches the same connection back and forth
> between two applications with different working sets?  Or a system
> that keeps persistent connections open even when they are idle?  Do
> you really believe that a connection that has not accessed a cache
> entry in 10 minutes still derives more benefit from that cache entry
> than it would from freeing up some memory?

Well, I think everyone agrees there are workloads that cause undesired
cache bloat.  What we have not found is a solution that doesn't cause
code complexity or undesired overhead, or one that >1% of users will
know how to use.

Unfortunately, because we have not found something we are happy with, we
have done nothing.  I agree LRU can be expensive.  What if we do some
kind of clock sweep and expiration like we do for shared buffers?  I
think the trick is figuring how frequently to do the sweep.  What if we
mark entries as unused every 10 queries, mark them as used on first use,
and delete cache entries that have not be used in the past 10 queries.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Protect syscache from bloating with negative cache entries

От
Gavin Flower
Дата:
On 18/01/2019 08:48, Bruce Momjian wrote:
> On Thu, Jan 17, 2019 at 11:33:35AM -0500, Robert Haas wrote:
>> The flaw in your thinking, as it seems to me, is that in your concern
>> for "the likelihood that cache flushes will simply remove entries
>> we'll soon have to rebuild," you're apparently unwilling to consider
>> the possibility of workloads where cache flushes will remove entries
>> we *won't* soon have to rebuild.  Every time that issue gets raised,
>> you seem to blow it off as if it were not a thing that really happens.
>> I can't make sense of that position.  Is it really so hard to imagine
>> a connection pooler that switches the same connection back and forth
>> between two applications with different working sets?  Or a system
>> that keeps persistent connections open even when they are idle?  Do
>> you really believe that a connection that has not accessed a cache
>> entry in 10 minutes still derives more benefit from that cache entry
>> than it would from freeing up some memory?
> Well, I think everyone agrees there are workloads that cause undesired
> cache bloat.  What we have not found is a solution that doesn't cause
> code complexity or undesired overhead, or one that >1% of users will
> know how to use.
>
> Unfortunately, because we have not found something we are happy with, we
> have done nothing.  I agree LRU can be expensive.  What if we do some
> kind of clock sweep and expiration like we do for shared buffers?  I
> think the trick is figuring how frequently to do the sweep.  What if we
> mark entries as unused every 10 queries, mark them as used on first use,
> and delete cache entries that have not be used in the past 10 queries.
>
If you take that approach, then this number should be configurable.  
What if I had 12 common queries I used in rotation?

The ARM3 processor cache logic was to simply eject an entry at random, 
as the obviously Acorn felt that the silicon required to have a more 
sophisticated algorithm would reduce the cache size too much!

I upgraded my Acorn Archimedes that had an 8MHZ bus, from an 8MHz ARM2 
to a 25MZ ARM3. that is a clock rate improvement of about 3 times.  
However BASIC programs ran about 7 times faster, which I put down to the 
ARM3 having a cache.

Obviously for Postgres this is not directly relevant, but I think it 
suggests that it may be worth considering replacing cache items at 
random.  As there are no pathological corner cases, and the logic is 
very simple.


Cheers,
Gavin





Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Fri, 18 Jan 2019 11:46:03 +1300, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote in
<4e62e6b7-0ffb-54ae-3757-5583fcca38c0@archidevsys.co.nz>
> On 18/01/2019 08:48, Bruce Momjian wrote:
> > On Thu, Jan 17, 2019 at 11:33:35AM -0500, Robert Haas wrote:
> >> The flaw in your thinking, as it seems to me, is that in your concern
> >> for "the likelihood that cache flushes will simply remove entries
> >> we'll soon have to rebuild," you're apparently unwilling to consider
> >> the possibility of workloads where cache flushes will remove entries
> >> we *won't* soon have to rebuild.  Every time that issue gets raised,
> >> you seem to blow it off as if it were not a thing that really happens.
> >> I can't make sense of that position.  Is it really so hard to imagine
> >> a connection pooler that switches the same connection back and forth
> >> between two applications with different working sets?  Or a system
> >> that keeps persistent connections open even when they are idle?  Do
> >> you really believe that a connection that has not accessed a cache
> >> entry in 10 minutes still derives more benefit from that cache entry
> >> than it would from freeing up some memory?
> > Well, I think everyone agrees there are workloads that cause undesired
> > cache bloat.  What we have not found is a solution that doesn't cause
> > code complexity or undesired overhead, or one that >1% of users will
> > know how to use.
> >
> > Unfortunately, because we have not found something we are happy with,
> > we
> > have done nothing.  I agree LRU can be expensive.  What if we do some
> > kind of clock sweep and expiration like we do for shared buffers?  I

So, it doesn't use LRU but a kind of clock-sweep method. If it
finds the size is about to exceed the threshold by
resiz(doubl)ing when the current hash is filled up, it tries to
trim away the entries that are left for a duration corresponding
to usage count. This is not a hard limit but seems to be a good
compromise.

> > think the trick is figuring how frequently to do the sweep.  What if
> > we
> > mark entries as unused every 10 queries, mark them as used on first
> > use,
> > and delete cache entries that have not be used in the past 10 queries.

As above, it tires pruning at every resizing time. So this adds
complexity to the frequent paths only by setting last accessed
time and incrementing access counter. It scans the whole hash at
resize time but it doesn't add much comparing to resizing itself.

> If you take that approach, then this number should be configurable. 
> What if I had 12 common queries I used in rotation?

This basically has two knobs. The minimum hash size to do the
pruning and idle time before reaping unused entries, per
catcache.

> The ARM3 processor cache logic was to simply eject an entry at random,
> as the obviously Acorn felt that the silicon required to have a more
> sophisticated algorithm would reduce the cache size too much!
>
> I upgraded my Acorn Archimedes that had an 8MHZ bus, from an 8MHz ARM2
> to a 25MZ ARM3. that is a clock rate improvement of about 3 times. 
> However BASIC programs ran about 7 times faster, which I put down to
> the ARM3 having a cache.
>
> Obviously for Postgres this is not directly relevant, but I think it
> suggests that it may be worth considering replacing cache items at
> random.  As there are no pathological corner cases, and the logic is
> very simple.

Memory was expensive than nowadays by.. about 10^3 times?  An
obvious advantage of random reaping is requiring less silicon. I
think we don't need to be so stingy but perhaps clock-sweep is at
the maximum we can pay.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Fri, 18 Jan 2019 16:39:29 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190118.163929.229869562.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello.
>
> At Fri, 18 Jan 2019 11:46:03 +1300, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote in
<4e62e6b7-0ffb-54ae-3757-5583fcca38c0@archidevsys.co.nz>
> > On 18/01/2019 08:48, Bruce Momjian wrote:
> > > Unfortunately, because we have not found something we are happy with,
> > > we
> > > have done nothing.  I agree LRU can be expensive.  What if we do some
> > > kind of clock sweep and expiration like we do for shared buffers?  I
>
> So, it doesn't use LRU but a kind of clock-sweep method. If it
> finds the size is about to exceed the threshold by
> resiz(doubl)ing when the current hash is filled up, it tries to
> trim away the entries that are left for a duration corresponding
> to usage count. This is not a hard limit but seems to be a good
> compromise.
>
> > > think the trick is figuring how frequently to do the sweep.  What if
> > > we
> > > mark entries as unused every 10 queries, mark them as used on first
> > > use,
> > > and delete cache entries that have not be used in the past 10 queries.
>
> As above, it tires pruning at every resizing time. So this adds
> complexity to the frequent paths only by setting last accessed
> time and incrementing access counter. It scans the whole hash at
> resize time but it doesn't add much comparing to resizing itself.
>
> > If you take that approach, then this number should be configurable. 
> > What if I had 12 common queries I used in rotation?
>
> This basically has two knobs. The minimum hash size to do the
> pruning and idle time before reaping unused entries, per
> catcache.

This is the rebased version.

0001: catcache pruning

syscache_memory_target controls per-cache basis minimum size
where this starts pruning.

syscache_prune_min_time controls minimum idle duration until an
catcache entry is removed.

0002: catcache statistics view

track_syscache_usage_interval is the interval statitics of
catcache is collected.

pg_stat_syscache is the view that shows the statistics.


0003: Remote GUC setting

It is independent from the above two, and heavily arguable.

pg_set_backend_config(pid, name, value) changes the GUC <name> on
the backend with <pid> to <value>.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
From 7071de30e79507f55d8021dc9c8b6801a292745c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   5 +
 src/backend/utils/cache/catcache.c            | 166 ++++++++++++++++++++++++--
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  28 ++++-
 6 files changed, 254 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..af3c52b868 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 18467d96d2..dbffec8067 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -733,7 +733,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     if (!IsParallelWorker())
+    {
         stmtStartTimestamp = GetCurrentTimestamp();
+
+        /* Set this timestamp as aproximated current time */
+        SetCatCacheClock(stmtStartTimestamp);
+    }
     else
         Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 8152f7e21e..ee40093553 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -72,9 +72,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -491,6 +506,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -842,6 +858,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -859,9 +876,129 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1275,6 +1412,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1820,11 +1962,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1843,13 +1987,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1877,8 +2022,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1899,17 +2044,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c216ed0922..134c357bf3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -80,6 +80,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2190,6 +2191,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..d82af3bd6c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..5d24809900 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +193,28 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 7cc50a1bf62290c704d90cd9b5b740d68cd8f646 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 15:48:28 +0900
Subject: [PATCH 2/3] Syscache usage tracking feature.

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  15 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 206 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            | 115 ++++++++++----
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   7 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 582 insertions(+), 45 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index af3c52b868..6dd024340b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6662,6 +6662,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval">
+      <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect system cache usage statistics in
+        milliseconds. This parameter is 0 by default, which means disabled.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f4d9e9daf7..30e2da935a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -904,6 +904,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1183,6 +1199,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 13da412c59..2c0c6b343e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/tqual.h"
 
@@ -125,6 +126,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -237,6 +239,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only databsae stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -631,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -645,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -684,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -4286,6 +4311,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6376,3 +6404,163 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and resturns the
+ * time remaining in the current interval in milliseconds. If'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+        return 0;
+
+    
+    /* Check aginst the in*/
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now write the file */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
+
+/*
+ * GUC assignment callback for track_syscache_usage_interval.
+ *
+ * Make a statistics file immedately when syscache statistics is turned
+ * on. Remove it as soon as turned off as well.
+ */
+void
+pgstat_track_syscache_assign_hook(int newval, void *extra)
+{
+    if (newval > 0)
+    {
+        /*
+         * Immediately create a stats file. It's safe since we're not midst
+         * accessing syscache.
+         */
+        pgstat_write_syscache_stats(true);
+    }
+    else
+    {
+        /* Turned off, immediately remove the statsfile */
+        char    fname[MAXPGPATH];
+
+        pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                         fname, MAXPGPATH);
+        unlink(fname);        /* don't care of the result */
+    }
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0c0891b33e..e7972e645f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3157,6 +3157,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 }
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_catcache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_catcache_update_timeout = true;
+                    enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_catcache_update_timeout)
+        {
+            disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false);
+            disable_idle_catcache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 053bb73863..0d32bf8daa 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1882,3 +1885,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        fread(&cacheid, sizeof(int), 1, fpin);
+        fread(&last_update, sizeof(TimestampTz), 1, fpin);
+        if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index ee40093553..4a3b3094a0 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -90,6 +90,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Timestamp used for any operation on caches. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -620,9 +624,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -698,9 +700,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -907,10 +907,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
      * cache_prune_min_age. The index of nremoved_entry is the value of the
      * clock-sweep counter, which takes from 0 up to 2.
      */
-    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
-    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0};
     int            nremoved_entry[3] = {0, 0, 0};
     int            j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
 #endif
 
     /* Return immediately if no pruning is wanted */
@@ -924,7 +925,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
     
-    /* Search the whole hash for entries to remove */
+    /*
+     * Search the whole hash for entries to remove. This is a quite time
+     * consuming task during catcache lookup, but accetable since now we are
+     * going to expand the hash table.
+     */
     for (i = 0; i < cp->cc_nbuckets; i++)
     {
         dlist_mutable_iter iter;
@@ -937,21 +942,21 @@ CatCacheCleanupOldEntries(CatCache *cp)
 
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
 #ifdef CATCACHE_STATS
             /* count catcache entries for each age class */
             ntotal++;
-            for (j = 0 ;
-                 ageclass[j] != 0.0 &&
-                     entry_age > cache_prune_min_age * ageclass[j] ;
-                 j++);
-            if (ageclass[j] == 0.0) j--;
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > cache_prune_min_age * ageclass[j])
+                j++;
             nentries[j]++;
 #endif
 
@@ -984,14 +989,17 @@ CatCacheCleanupOldEntries(CatCache *cp)
     }
 
 #ifdef CATCACHE_STATS
+    StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6,
+                     "number of syscache age class must be 6");
     ereport(DEBUG1,
-            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d)
naccessed(0:%d,1:%d, 2:%d)",
 
                      nremoved, ntotal,
                      ageclass[0] * cache_prune_min_age, nentries[0],
                      ageclass[1] * cache_prune_min_age, nentries[1],
                      ageclass[2] * cache_prune_min_age, nentries[2],
                      ageclass[3] * cache_prune_min_age, nentries[3],
                      ageclass[4] * cache_prune_min_age, nentries[4],
+                     nentries[5],
                      nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
              errhidestmt(true)));
 #endif
@@ -1368,9 +1376,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1430,9 +1436,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1441,9 +1445,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1571,9 +1573,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1684,9 +1684,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1743,9 +1741,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2253,3 +2249,64 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /* cache_prune_min_age can be changed on-session, fill it every time */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..f039ecd805 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7415c4faab..6b0fdbbd87 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -629,6 +630,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1240,6 +1243,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 134c357bf3..e8d7b6998a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3154,6 +3154,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d82af3bd6c..4a6c9fceb5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -554,6 +554,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_syscache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3ecc2e12c3..11fc1f3075 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9669,6 +9669,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..69b9a976f0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 313ca5f3c3..ee9968f81a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1134,6 +1134,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1218,7 +1219,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1353,5 +1355,6 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
+extern void pgstat_track_syscache_assign_hook(int newval, void *extra);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 5d24809900..4d51975920 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -65,10 +65,8 @@ typedef struct catcache
     int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -81,7 +79,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..0ab441a364 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_CATCACHE_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e384cd2279..1991e75e97 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1919,6 +1919,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2350,7 +2372,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3

From 4434b92429d9b60baed6f45bf8132a67225b0671 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 18 Jan 2019 17:16:12 +0900
Subject: [PATCH 3/3] Remote GUC setting feature and non-xact GUC config.

This adds two features at once. (will be split later).

One is non-transactional GUC setting feature. This allows setting GUC
variable set by the action GUC_ACTION_NONXACT(the name requires
condieration) survive beyond rollback. It is required by remote guc
setting to work sanely. Without the feature a remote-set value within
a trasction will disappear when involved in rollback. The only local
interface for the NONXACT action is set_config(name, value,
is_local=false, is_nonxact = true).

The second is remote guc setting feature. It uses ProcSignal to notify
the target server.
---
 doc/src/sgml/config.sgml             |   4 +
 doc/src/sgml/func.sgml               |  30 ++
 src/backend/catalog/system_views.sql |   7 +-
 src/backend/postmaster/pgstat.c      |   3 +
 src/backend/storage/ipc/ipci.c       |   2 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/tcop/postgres.c          |  10 +
 src/backend/utils/misc/README        |  26 +-
 src/backend/utils/misc/guc.c         | 619 +++++++++++++++++++++++++++++++++--
 src/include/catalog/pg_proc.dat      |  10 +-
 src/include/pgstat.h                 |   3 +-
 src/include/storage/procsignal.h     |   3 +
 src/include/utils/guc.h              |  13 +-
 src/include/utils/guc_tables.h       |   5 +-
 src/test/regress/expected/guc.out    | 223 +++++++++++++
 src/test/regress/sql/guc.sql         |  88 +++++
 16 files changed, 1002 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6dd024340b..d024d9b069 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -281,6 +281,10 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter
      </listitem>
     </itemizedlist>
 
+    <para>
+     Also values on other sessions can be set using the SQL
+     function <function>pg_set_backend_setting</function>.
+    </para>
    </sect2>
 
    <sect2>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4930ec17f6..aeb0c4483a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18687,6 +18687,20 @@ SELECT collation for ('foo' COLLATE "de_DE");
        <entry><type>text</type></entry>
        <entry>set parameter and return new value</entry>
       </row>
+      <row>
+       <entry>
+        <indexterm>
+         <primary>pg_set_backend_setting</primary>
+        </indexterm>
+        <literal><function>pg_set_backend_config(
+                            <parameter>process_id</parameter>,
+                            <parameter>setting_name</parameter>,
+                            <parameter>new_value</parameter>)
+                            </function></literal>
+       </entry>
+       <entry><type>bool</type></entry>
+       <entry>set parameter on another session</entry>
+      </row>
      </tbody>
     </tgroup>
    </table>
@@ -18741,6 +18755,22 @@ SELECT set_config('log_statement_stats', 'off', false);
 ------------
  off
 (1 row)
+</programlisting>
+   </para>
+
+   <para>
+    <function>pg_set_backend_config</function> sets the parameter
+    <parameter>setting_name</parameter> to
+    <parameter>new_value</parameter> on the other session with PID
+    <parameter>process_id</parameter>. The setting is always session-local and
+    returns true if succeeded.  An example:
+<programlisting>
+SELECT pg_set_backend_config(2134, 'work_mem', '16MB');
+
+pg_set_backend_config
+------------
+ t
+(1 row)
 </programlisting>
    </para>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 30e2da935a..3d2e341c19 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -474,7 +474,7 @@ CREATE VIEW pg_settings AS
 CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_settings
     WHERE new.name = old.name DO
-    SELECT set_config(old.name, new.setting, 'f');
+    SELECT set_config(old.name, new.setting, 'f', 'f');
 
 CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_settings
@@ -1049,6 +1049,11 @@ CREATE OR REPLACE FUNCTION
   RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_promote'
   PARALLEL SAFE;
 
+CREATE OR REPLACE FUNCTION set_config (
+        setting_name text, new_value text, is_local boolean, is_nonxact boolean DEFAULT false)
+        RETURNS text STRICT VOLATILE LANGUAGE internal AS 'set_config_by_name'
+        PARALLEL UNSAFE;
+
 -- legacy definition for compatibility with 9.3
 CREATE OR REPLACE FUNCTION
   json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2c0c6b343e..5d6c0edcd9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3707,6 +3707,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_REMOTE_GUC:
+            event_name = "RemoteGUC";
+            break;
             /* no default case, so that compiler will warn */
     }
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2849e47d99..044107b354 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -148,6 +148,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, GucShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    GucShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7605b2c367..98c0f84378 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/guc.h"
 
 
 /*
@@ -292,6 +293,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
     if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
         RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+    if (CheckProcSignal(PROCSIG_REMOTE_GUC))
+        HandleRemoteGucSetInterrupt();
+
     SetLatch(MyLatch);
 
     latch_sigusr1_handler();
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e7972e645f..3db2a7eacc 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3165,6 +3165,10 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    /* We don't want chage GUC variables while running a query */
+    if (RemoteGucChangePending && DoingCommandRead)
+        HandleGucRemoteChanges();
 }
 
 
@@ -4201,6 +4205,12 @@ PostgresMain(int argc, char *argv[],
             send_ready_for_query = false;
         }
 
+        /*
+         * (2.5) Process some pending works.
+         */
+        if (RemoteGucChangePending)
+            HandleGucRemoteChanges();
+
         /*
          * (2) Allow asynchronous signals to be executed immediately if they
          * come in while we are waiting for client input. (This must be
diff --git a/src/backend/utils/misc/README b/src/backend/utils/misc/README
index 6e294386f7..42ae6c1a8f 100644
--- a/src/backend/utils/misc/README
+++ b/src/backend/utils/misc/README
@@ -169,10 +169,14 @@ Entry to a function with a SET option:
 Plain SET command:
 
     If no stack entry of current level:
-        Push new stack entry w/prior value and state SET
+        Push new stack entry w/prior value and state SET or
+        push new stack entry w/o value and state NONXACT.
     else if stack entry's state is SAVE, SET, or LOCAL:
         change stack state to SET, don't change saved value
         (here we are forgetting effects of prior set action)
+    else if stack entry's state is NONXACT:
+        change stack state to NONXACT_SET, set the current value to
+        prior.
     else (entry must have state SET+LOCAL):
         discard its masked value, change state to SET
         (here we are forgetting effects of prior SET and SET LOCAL)
@@ -185,13 +189,20 @@ SET LOCAL command:
     else if stack entry's state is SAVE or LOCAL or SET+LOCAL:
         no change to stack entry
         (in SAVE case, SET LOCAL will be forgotten at func exit)
+    else if stack entry's state is NONXACT:
+        set current value to both prior and masked slots. set state
+        NONXACT+LOCAL.
     else (entry must have state SET):
         put current active into its masked slot, set state SET+LOCAL
     Now set new value.
 
+Setting by NONXACT action (no command exists):
+    Always blow away existing stack then create a new NONXACT entry.    
+
 Transaction or subtransaction abort:
 
-    Pop stack entries, restoring prior value, until top < subxact depth
+    Pop stack entries, restoring prior value unless the stack entry's
+    state is NONXACT, until top < subxact depth
 
 Transaction or subtransaction commit (incl. successful function exit):
 
@@ -199,9 +210,9 @@ Transaction or subtransaction commit (incl. successful function exit):
 
         if entry's state is SAVE:
             pop, restoring prior value
-        else if level is 1 and entry's state is SET+LOCAL:
+        else if level is 1 and entry's state is SET+LOCAL or NONXACT+LOCAL:
             pop, restoring *masked* value
-        else if level is 1 and entry's state is SET:
+        else if level is 1 and entry's state is SET or NONXACT+SET:
             pop, discarding old value
         else if level is 1 and entry's state is LOCAL:
             pop, restoring prior value
@@ -210,9 +221,9 @@ Transaction or subtransaction commit (incl. successful function exit):
         else
             merge entries of level N-1 and N as specified below
 
-The merged entry will have level N-1 and prior = older prior, so easiest
-to keep older entry and free newer.  There are 12 possibilities since
-we already handled level N state = SAVE:
+The merged entry will have level N-1 and prior = older prior, so
+easiest to keep older entry and free newer.  Disregarding to NONXACT,
+here are 12 possibilities since we already handled level N state = SAVE:
 
 N-1        N
 
@@ -232,6 +243,7 @@ SET+LOCAL    SET        discard top prior and second masked, state SET
 SET+LOCAL    LOCAL        discard top prior, no change to stack entry
 SET+LOCAL    SET+LOCAL    discard top prior, copy masked, state S+L
 
+(TODO: states involving NONXACT)
 
 RESET is executed like a SET, but using the reset_val as the desired new
 value.  (We do not provide a RESET LOCAL command, but SET LOCAL TO DEFAULT
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e8d7b6998a..5a4eaed622 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -217,6 +217,37 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context,
                           bool applySettings, int elevel);
 
 
+/* Enum and struct to command GUC setting to another backend */
+typedef enum
+{
+    REMGUC_VACANT,
+    REMGUC_REQUEST,
+    REMGUC_INPROCESS,
+    REMGUC_DONE,
+    REMGUC_CANCELING,
+    REMGUC_CANCELED,
+} remote_guc_status;
+
+#define GUC_REMOTE_MAX_VALUE_LEN  1024        /* an arbitrary value */
+#define GUC_REMOTE_CANCEL_TIMEOUT 5000        /* in milliseconds */
+
+typedef struct
+{
+    remote_guc_status     state;
+    char name[NAMEDATALEN];
+    char value[GUC_REMOTE_MAX_VALUE_LEN];
+    int     sourcepid;
+    int     targetpid;
+    Oid     userid;
+    bool success;
+    volatile Latch *sender_latch;
+    LWLock    lock;
+} GucRemoteSetting;
+
+static GucRemoteSetting *remote_setting;
+
+volatile bool RemoteGucChangePending = false;
+
 /*
  * Options for enum values defined in this module.
  *
@@ -3161,7 +3192,7 @@ static struct config_int ConfigureNamesInt[] =
         },
         &pgstat_track_syscache_usage_interval,
         0, 0, INT_MAX / 2,
-        NULL, NULL, NULL
+        NULL, &pgstat_track_syscache_assign_hook, NULL
     },
 
     {
@@ -4730,7 +4761,6 @@ discard_stack_value(struct config_generic *gconf, config_var_value *val)
     set_extra_field(gconf, &(val->extra), NULL);
 }
 
-
 /*
  * Fetch the sorted array pointer (exported for help_config.c's use ONLY)
  */
@@ -5522,6 +5552,22 @@ push_old_value(struct config_generic *gconf, GucAction action)
 
     /* Do we already have a stack entry of the current nest level? */
     stack = gconf->stack;
+
+    /* NONXACT action make existing stack useles */
+    if (action == GUC_ACTION_NONXACT)
+    {
+        while (stack)
+        {
+            GucStack *prev = stack->prev;
+
+            discard_stack_value(gconf, &stack->prior);
+            discard_stack_value(gconf, &stack->masked);
+            pfree(stack);
+            stack = prev;
+        }
+        stack = gconf->stack = NULL;
+    }
+
     if (stack && stack->nest_level >= GUCNestLevel)
     {
         /* Yes, so adjust its state if necessary */
@@ -5529,28 +5575,63 @@ push_old_value(struct config_generic *gconf, GucAction action)
         switch (action)
         {
             case GUC_ACTION_SET:
-                /* SET overrides any prior action at same nest level */
-                if (stack->state == GUC_SET_LOCAL)
+                if (stack->state == GUC_NONXACT)
                 {
-                    /* must discard old masked value */
-                    discard_stack_value(gconf, &stack->masked);
+                    /* NONXACT rollbacks to the current value */
+                    stack->scontext = gconf->scontext;
+                    set_stack_value(gconf, &stack->prior);
+                    stack->state = GUC_NONXACT_SET;
                 }
-                stack->state = GUC_SET;
+                else 
+                {
+                    /* SET overrides other prior actions at same nest level */
+                    if (stack->state == GUC_SET_LOCAL)
+                    {
+                        /* must discard old masked value */
+                        discard_stack_value(gconf, &stack->masked);
+                    }
+                    stack->state = GUC_SET;
+                }
+
                 break;
+
             case GUC_ACTION_LOCAL:
                 if (stack->state == GUC_SET)
                 {
-                    /* SET followed by SET LOCAL, remember SET's value */
+                    /* SET followed by SET LOCAL, remember it's value */
                     stack->masked_scontext = gconf->scontext;
                     set_stack_value(gconf, &stack->masked);
                     stack->state = GUC_SET_LOCAL;
                 }
+                else if (stack->state == GUC_NONXACT)
+                {
+                    /*
+                     * NONXACT followed by SET LOCAL, both prior and masked
+                     * are set to the current value
+                     */
+                    stack->scontext = gconf->scontext;
+                    set_stack_value(gconf, &stack->prior);
+                    stack->masked_scontext = stack->scontext;
+                    stack->masked = stack->prior;
+                    stack->state = GUC_NONXACT_LOCAL;
+                }
+                else if (stack->state == GUC_NONXACT_SET)
+                {
+                    /* NONXACT_SET followed by SET LOCAL, set masked */
+                    stack->masked_scontext = gconf->scontext;
+                    set_stack_value(gconf, &stack->masked);
+                    stack->state = GUC_NONXACT_LOCAL;
+                }
                 /* in all other cases, no change to stack entry */
                 break;
             case GUC_ACTION_SAVE:
                 /* Could only have a prior SAVE of same variable */
                 Assert(stack->state == GUC_SAVE);
                 break;
+
+            case GUC_ACTION_NONXACT:
+                Assert(false);
+                break;
         }
         Assert(guc_dirty);        /* must be set already */
         return;
@@ -5566,6 +5647,7 @@ push_old_value(struct config_generic *gconf, GucAction action)
 
     stack->prev = gconf->stack;
     stack->nest_level = GUCNestLevel;
+        
     switch (action)
     {
         case GUC_ACTION_SET:
@@ -5577,10 +5659,15 @@ push_old_value(struct config_generic *gconf, GucAction action)
         case GUC_ACTION_SAVE:
             stack->state = GUC_SAVE;
             break;
+        case GUC_ACTION_NONXACT:
+            stack->state = GUC_NONXACT;
+            break;
     }
     stack->source = gconf->source;
     stack->scontext = gconf->scontext;
-    set_stack_value(gconf, &stack->prior);
+
+    if (action != GUC_ACTION_NONXACT)
+        set_stack_value(gconf, &stack->prior);
 
     gconf->stack = stack;
 
@@ -5675,22 +5762,31 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
              * stack entries to avoid leaking memory.  If we do set one of
              * those flags, unused fields will be cleaned up after restoring.
              */
-            if (!isCommit)        /* if abort, always restore prior value */
-                restorePrior = true;
+            if (!isCommit)
+            {
+                /* GUC_NONXACT does't rollback */
+                if (stack->state != GUC_NONXACT)
+                    restorePrior = true;
+            }
             else if (stack->state == GUC_SAVE)
                 restorePrior = true;
             else if (stack->nest_level == 1)
             {
                 /* transaction commit */
-                if (stack->state == GUC_SET_LOCAL)
+                if (stack->state == GUC_SET_LOCAL ||
+                    stack->state == GUC_NONXACT_LOCAL)
                     restoreMasked = true;
-                else if (stack->state == GUC_SET)
+                else if (stack->state == GUC_SET ||
+                         stack->state == GUC_NONXACT_SET)
                 {
                     /* we keep the current active value */
                     discard_stack_value(gconf, &stack->prior);
                 }
-                else            /* must be GUC_LOCAL */
+                else if (stack->state != GUC_NONXACT)
+                {
+                    /* must be GUC_LOCAL */
                     restorePrior = true;
+                }
             }
             else if (prev == NULL ||
                      prev->nest_level < stack->nest_level - 1)
@@ -5712,11 +5808,27 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                         break;
 
                     case GUC_SET:
-                        /* next level always becomes SET */
-                        discard_stack_value(gconf, &stack->prior);
-                        if (prev->state == GUC_SET_LOCAL)
+                        if (prev->state == GUC_SET ||
+                            prev->state == GUC_NONXACT_SET)
+                        {
+                            discard_stack_value(gconf, &stack->prior);
+                        }
+                        else if (prev->state == GUC_NONXACT)
+                        {
+                            prev->scontext = stack->scontext;
+                            prev->prior = stack->prior;
+                            prev->state = GUC_NONXACT_SET;
+                        }
+                        else if (prev->state == GUC_SET_LOCAL ||
+                                 prev->state == GUC_NONXACT_LOCAL)
+                        {
+                            discard_stack_value(gconf, &stack->prior);
                             discard_stack_value(gconf, &prev->masked);
-                        prev->state = GUC_SET;
+                            if (prev->state == GUC_SET_LOCAL)
+                                prev->state = GUC_SET;
+                            else
+                                prev->state = GUC_NONXACT_SET;
+                        }
                         break;
 
                     case GUC_LOCAL:
@@ -5727,6 +5839,16 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                             prev->masked = stack->prior;
                             prev->state = GUC_SET_LOCAL;
                         }
+                        else if (prev->state == GUC_NONXACT)
+                        {
+                            prev->prior = stack->masked;
+                            prev->scontext = stack->masked_scontext;
+                            prev->masked = stack->masked;
+                            prev->masked_scontext = stack->masked_scontext;
+                            discard_stack_value(gconf, &stack->prior);
+                            discard_stack_value(gconf, &stack->masked);
+                            prev->state = GUC_NONXACT_SET;
+                        }
                         else
                         {
                             /* else just forget this stack level */
@@ -5735,15 +5857,32 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                         break;
 
                     case GUC_SET_LOCAL:
-                        /* prior state at this level no longer wanted */
-                        discard_stack_value(gconf, &stack->prior);
-                        /* copy down the masked state */
-                        prev->masked_scontext = stack->masked_scontext;
-                        if (prev->state == GUC_SET_LOCAL)
-                            discard_stack_value(gconf, &prev->masked);
-                        prev->masked = stack->masked;
-                        prev->state = GUC_SET_LOCAL;
+                        if (prev->state == GUC_NONXACT)
+                        {
+                            prev->prior = stack->prior;
+                            prev->masked = stack->prior;
+                            discard_stack_value(gconf, &stack->prior);
+                            discard_stack_value(gconf, &stack->masked);
+                            prev->state = GUC_NONXACT_SET;
+                        }
+                        else if (prev->state != GUC_NONXACT_SET)
+                        {
+                            /* prior state at this level no longer wanted */
+                            discard_stack_value(gconf, &stack->prior);
+                            /* copy down the masked state */
+                            prev->masked_scontext = stack->masked_scontext;
+                            if (prev->state == GUC_SET_LOCAL)
+                                discard_stack_value(gconf, &prev->masked);
+                            prev->masked = stack->masked;
+                            prev->state = GUC_SET_LOCAL;
+                        }
                         break;
+                    case GUC_NONXACT:
+                    case GUC_NONXACT_SET:
+                    case GUC_NONXACT_LOCAL:
+                        Assert(false);
+                        break;
+                        
                 }
             }
 
@@ -8024,7 +8163,8 @@ set_config_by_name(PG_FUNCTION_ARGS)
     char       *name;
     char       *value;
     char       *new_value;
-    bool        is_local;
+    int            set_action = GUC_ACTION_SET;
+
 
     if (PG_ARGISNULL(0))
         ereport(ERROR,
@@ -8044,18 +8184,27 @@ set_config_by_name(PG_FUNCTION_ARGS)
      * Get the desired state of is_local. Default to false if provided value
      * is NULL
      */
-    if (PG_ARGISNULL(2))
-        is_local = false;
-    else
-        is_local = PG_GETARG_BOOL(2);
+    if (!PG_ARGISNULL(2) && PG_GETARG_BOOL(2))
+        set_action = GUC_ACTION_LOCAL;
+
+    /*
+     * Get the desired state of is_nonxact. Default to false if provided value
+     * is NULL
+     */
+    if (!PG_ARGISNULL(3) && PG_GETARG_BOOL(3))
+    {
+        if (set_action == GUC_ACTION_LOCAL)
+            ereport(ERROR,
+                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                     errmsg("Only one of is_local and is_nonxact can be true")));
+        set_action = GUC_ACTION_NONXACT;
+    }
 
     /* Note SET DEFAULT (argstring == NULL) is equivalent to RESET */
     (void) set_config_option(name,
                              value,
                              (superuser() ? PGC_SUSET : PGC_USERSET),
-                             PGC_S_SESSION,
-                             is_local ? GUC_ACTION_LOCAL : GUC_ACTION_SET,
-                             true, 0, false);
+                             PGC_S_SESSION, set_action, true, 0, false);
 
     /* get the new current value */
     new_value = GetConfigOptionByName(name, NULL, false);
@@ -8064,7 +8213,6 @@ set_config_by_name(PG_FUNCTION_ARGS)
     PG_RETURN_TEXT_P(cstring_to_text(new_value));
 }
 
-
 /*
  * Common code for DefineCustomXXXVariable subroutines: allocate the
  * new variable's config struct and fill in generic fields.
@@ -8263,6 +8411,13 @@ reapply_stacked_values(struct config_generic *variable,
                                          WARNING, false);
                 break;
 
+            case GUC_NONXACT:
+                (void) set_config_option(name, curvalue,
+                                         curscontext, cursource,
+                                         GUC_ACTION_NONXACT, true,
+                                         WARNING, false);
+                break;
+
             case GUC_LOCAL:
                 (void) set_config_option(name, curvalue,
                                          curscontext, cursource,
@@ -8282,6 +8437,33 @@ reapply_stacked_values(struct config_generic *variable,
                                          GUC_ACTION_LOCAL, true,
                                          WARNING, false);
                 break;
+
+            case GUC_NONXACT_SET:
+                /* first, apply the masked value as SET */
+                (void) set_config_option(name, stack->masked.val.stringval,
+                                         stack->masked_scontext, PGC_S_SESSION,
+                                         GUC_ACTION_NONXACT, true,
+                                         WARNING, false);
+                /* then apply the current value as LOCAL */
+                (void) set_config_option(name, curvalue,
+                                         curscontext, cursource,
+                                         GUC_ACTION_SET, true,
+                                         WARNING, false);
+                break;
+
+            case GUC_NONXACT_LOCAL:
+                /* first, apply the masked value as SET */
+                (void) set_config_option(name, stack->masked.val.stringval,
+                                         stack->masked_scontext, PGC_S_SESSION,
+                                         GUC_ACTION_NONXACT, true,
+                                         WARNING, false);
+                /* then apply the current value as LOCAL */
+                (void) set_config_option(name, curvalue,
+                                         curscontext, cursource,
+                                         GUC_ACTION_LOCAL, true,
+                                         WARNING, false);
+                break;
+
         }
 
         /* If we successfully made a stack entry, adjust its nest level */
@@ -10260,6 +10442,373 @@ GUCArrayReset(ArrayType *array)
     return newarray;
 }
 
+Size
+GucShmemSize(void)
+{
+    Size size;
+
+    size = sizeof(GucRemoteSetting);
+
+    return size;
+}
+
+void
+GucShmemInit(void)
+{
+    Size    size;
+    bool    found;
+
+    size = sizeof(GucRemoteSetting);
+    remote_setting = (GucRemoteSetting *)
+        ShmemInitStruct("GUC remote setting", size, &found);
+
+    if (!found)
+    {
+        MemSet(remote_setting, 0, size);
+        LWLockInitialize(&remote_setting->lock, LWLockNewTrancheId());
+    }
+
+    LWLockRegisterTranche(remote_setting->lock.tranche, "guc_remote");
+}
+
+/*
+ * set_backend_config: SQL callable function to set GUC variable of remote
+ * session.
+ */
+Datum
+set_backend_config(PG_FUNCTION_ARGS)
+{
+    int        pid   = PG_GETARG_INT32(0);
+    char   *name  = text_to_cstring(PG_GETARG_TEXT_P(1));
+    char   *value = text_to_cstring(PG_GETARG_TEXT_P(2));
+    TimestampTz    cancel_start;
+    PgBackendStatus *beentry;
+    int beid;
+    int rc;
+
+    if (strlen(name) >= NAMEDATALEN)
+        ereport(ERROR,
+                (errcode(ERRCODE_NAME_TOO_LONG),
+                 errmsg("name of GUC variable is too long")));
+    if (strlen(value) >= GUC_REMOTE_MAX_VALUE_LEN)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("value is too long"),
+                 errdetail("Maximum acceptable length of value is %d",
+                     GUC_REMOTE_MAX_VALUE_LEN - 1)));
+
+    /* find beentry for given pid */
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * This will be checked out by SendProcSignal but do here to emit
+     * appropriate message message.
+     */
+    if (!beentry)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("process PID %d not found", pid)));
+
+    /* allow only client backends */
+    if (beentry->st_backendType != B_BACKEND)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("not a client backend")));
+    
+    /*
+     * Wait if someone is sending a request. We need to wait with timeout
+     * since the current user of the struct doesn't wake me up.
+     */
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    while (remote_setting->state != REMGUC_VACANT)
+    {
+        LWLockRelease(&remote_setting->lock);
+        rc = WaitLatch(&MyProc->procLatch,
+                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                       200, PG_WAIT_ACTIVITY);
+
+        if (rc & WL_POSTMASTER_DEATH)
+            return (Datum) BoolGetDatum(false);
+
+        CHECK_FOR_INTERRUPTS();
+
+        LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    }
+
+    /* my turn, send a request */
+    Assert(remote_setting->state == REMGUC_VACANT);
+
+    remote_setting->state = REMGUC_REQUEST;
+    remote_setting->sourcepid = MyProcPid;
+    remote_setting->targetpid = pid;
+    remote_setting->userid = GetUserId();
+
+    strncpy(remote_setting->name, name, NAMEDATALEN);
+    remote_setting->name[NAMEDATALEN - 1] = 0;
+    strncpy(remote_setting->value, value, GUC_REMOTE_MAX_VALUE_LEN);
+    remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0;
+    remote_setting->sender_latch = MyLatch;
+
+    LWLockRelease(&remote_setting->lock);
+
+    if (SendProcSignal(pid, PROCSIG_REMOTE_GUC, InvalidBackendId) < 0)
+    {
+        remote_setting->state = REMGUC_VACANT;
+        ereport(ERROR,
+                (errmsg("could not signal backend with PID %d: %m", pid)));
+    }
+
+    /*
+     * This request is processed only while idle time of peer so it may take a
+     * long time before we get a response.
+     */
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    while (remote_setting->state != REMGUC_DONE)
+    {
+        LWLockRelease(&remote_setting->lock);
+        rc = WaitLatch(&MyProc->procLatch,
+                       WL_LATCH_SET | WL_POSTMASTER_DEATH,
+                       -1, PG_WAIT_ACTIVITY);
+
+        /* don't care of the state in the case.. */
+        if (rc & WL_POSTMASTER_DEATH)
+            return (Datum) BoolGetDatum(false);
+
+        LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+
+        /* get out if we got a query cancel request */
+        if (QueryCancelPending)
+            break;
+    }
+
+    /*
+     * Cancel the requset if possible. We cannot cancel the request in the
+     * case peer have processed it. We don't see QueryCancelPending but the
+     * request status so that the case is handled properly.
+     */
+    if (remote_setting->state == REMGUC_REQUEST)
+    {
+        Assert(QueryCancelPending);
+
+        remote_setting->state = REMGUC_CANCELING;
+        LWLockRelease(&remote_setting->lock);
+
+        if (SendProcSignal(pid,
+                           PROCSIG_REMOTE_GUC, InvalidBackendId) < 0)
+        {
+            remote_setting->state = REMGUC_VACANT;
+            ereport(ERROR,
+                    (errmsg("could not signal backend with PID %d: %m",
+                            pid)));
+        }
+
+        /* Peer must respond shortly, don't sleep for a long time. */
+        
+        cancel_start = GetCurrentTimestamp();
+
+        LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+        while (remote_setting->state != REMGUC_CANCELED &&
+               !TimestampDifferenceExceeds(cancel_start, GetCurrentTimestamp(),
+                                           GUC_REMOTE_CANCEL_TIMEOUT))
+        {
+            LWLockRelease(&remote_setting->lock);
+            rc = WaitLatch(&MyProc->procLatch,
+                           WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                           GUC_REMOTE_CANCEL_TIMEOUT, PG_WAIT_ACTIVITY);
+
+            /* don't care of the state in the case.. */
+            if (rc & WL_POSTMASTER_DEATH)
+                return (Datum) BoolGetDatum(false);
+
+            LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+        }
+
+        if (remote_setting->state != REMGUC_CANCELED)
+        {
+            remote_setting->state = REMGUC_VACANT;
+            ereport(ERROR, (errmsg("failed cancelling remote GUC request")));
+        }
+
+        remote_setting->state = REMGUC_VACANT;
+        LWLockRelease(&remote_setting->lock);
+
+        ereport(INFO,
+                (errmsg("remote GUC change request to PID %d is canceled",
+                              pid)));
+
+        return (Datum) BoolGetDatum(false);
+    }
+
+    Assert (remote_setting->state == REMGUC_DONE);
+
+    /* ereport exits on query cancel, we need this before that */
+    remote_setting->state = REMGUC_VACANT;
+
+    if (QueryCancelPending)
+        ereport(INFO,
+                (errmsg("remote GUC change request to PID %d already completed",
+                        pid)));
+                
+    if (!remote_setting->success)
+        ereport(ERROR,
+                (errmsg("%s", remote_setting->value)));
+
+    LWLockRelease(&remote_setting->lock);
+
+    return (Datum) BoolGetDatum(true);
+}
+
+
+void
+HandleRemoteGucSetInterrupt(void)
+{
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+
+    /* check if any request is being sent to me */
+    if (remote_setting->targetpid == MyProcPid)
+    {
+        switch (remote_setting->state)
+        {
+        case REMGUC_REQUEST:
+            InterruptPending = true;
+            RemoteGucChangePending = true;
+            break;
+        case REMGUC_CANCELING:
+            InterruptPending = true;
+            RemoteGucChangePending = true;
+            remote_setting->state = REMGUC_CANCELED;
+            SetLatch(remote_setting->sender_latch);
+            break;
+        default:
+            break;
+        }
+    }
+    LWLockRelease(&remote_setting->lock);
+}
+
+void
+HandleGucRemoteChanges(void)
+{
+    MemoryContext currentcxt = CurrentMemoryContext;
+    bool    canceling = false;
+    bool    process_request = true;
+    int        saveInterruptHoldoffCount = 0;
+    int        saveQueryCancelHoldoffCount = 0;
+
+    RemoteGucChangePending = false;
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+
+    /* skip if this request is no longer for me */
+    if (remote_setting->targetpid != MyProcPid)
+        process_request = false;
+    else
+    {
+        switch (remote_setting->state)
+        {
+        case REMGUC_REQUEST:
+            remote_setting->state = REMGUC_INPROCESS;
+            break;
+        case REMGUC_CANCELING:
+            /*
+             * This request is already canceled but entered this function
+             * before receiving signal. Cancel the request here.
+             */
+            remote_setting->state = REMGUC_CANCELED;
+            remote_setting->success = false;
+            canceling = true;
+            break;
+        case REMGUC_VACANT:
+        case REMGUC_CANCELED:
+        case REMGUC_INPROCESS:
+        case REMGUC_DONE:
+            /* Just ignore the cases */
+            process_request = false;
+            break;
+        }
+    }
+
+    LWLockRelease(&remote_setting->lock);
+
+    if (!process_request)
+        return;
+
+    if (canceling)
+    {
+        SetLatch(remote_setting->sender_latch);
+        return;
+    }
+
+
+    /* Okay, actually modify variable */
+    remote_setting->success = true;
+
+    PG_TRY();
+    {
+        bool     has_privilege;
+        bool     is_superuser;
+        bool end_transaction = false;
+        /*
+         * XXXX: ERROR resets the following varialbes but we don't want that.
+         */
+        saveInterruptHoldoffCount = InterruptHoldoffCount;
+        saveQueryCancelHoldoffCount = QueryCancelHoldoffCount;
+
+        /* superuser_arg requires a transaction */
+        if (!IsTransactionState())
+        {
+            StartTransactionCommand();
+            end_transaction  = true;
+        }
+        is_superuser = superuser_arg(remote_setting->userid);
+        has_privilege = is_superuser ||
+            has_privs_of_role(remote_setting->userid, GetUserId());
+
+        if (end_transaction)
+            CommitTransactionCommand();
+
+        if (!has_privilege)
+            elog(ERROR, "role %u is not allowed to set GUC variables on the session with PID %d",
+                 remote_setting->userid, MyProcPid);
+        
+        (void) set_config_option(remote_setting->name, remote_setting->value,
+                                 is_superuser ? PGC_SUSET : PGC_USERSET,
+                                 PGC_S_SESSION, GUC_ACTION_NONXACT,
+                                 true, ERROR, false);
+    }
+    PG_CATCH();
+    {
+        ErrorData *errdata;
+        MemoryContextSwitchTo(currentcxt);
+        errdata = CopyErrorData();
+        remote_setting->success = false;
+        strncpy(remote_setting->value, errdata->message,
+                GUC_REMOTE_MAX_VALUE_LEN);
+        remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0;
+        FlushErrorState();
+
+        /* restore the saved value */
+        InterruptHoldoffCount = saveInterruptHoldoffCount ;
+        QueryCancelHoldoffCount = saveQueryCancelHoldoffCount;
+        
+    }
+    PG_END_TRY();
+
+    ereport(LOG,
+            (errmsg("GUC variable \"%s\" is changed to \"%s\" by request from another backend with PID %d",
+                    remote_setting->name, remote_setting->value,
+                    remote_setting->sourcepid)));
+
+    LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE);
+    remote_setting->state = REMGUC_DONE;
+    LWLockRelease(&remote_setting->lock);
+
+    SetLatch(remote_setting->sender_latch);
+}
+
 /*
  * Validate a proposed option setting for GUCArrayAdd/Delete/Reset.
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 11fc1f3075..54d0c3917e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5700,8 +5700,8 @@
   proargtypes => 'text bool', prosrc => 'show_config_by_name_missing_ok' },
 { oid => '2078', descr => 'SET X as a function',
   proname => 'set_config', proisstrict => 'f', provolatile => 'v',
-  proparallel => 'u', prorettype => 'text', proargtypes => 'text text bool',
-  prosrc => 'set_config_by_name' },
+  proparallel => 'u', prorettype => 'text',
+  proargtypes => 'text text bool bool', prosrc => 'set_config_by_name' },
 { oid => '2084', descr => 'SHOW ALL as a function',
   proname => 'pg_show_all_settings', prorows => '1000', proretset => 't',
   provolatile => 's', prorettype => 'record', proargtypes => '',
@@ -9678,6 +9678,12 @@
   proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
   proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
   prosrc => 'pgstat_get_syscache_stats' },
+{ oid => '3424',
+  descr => 'set config of another backend',
+  proname => 'pg_set_backend_config', proisstrict => 'f',
+  proretset => 'f', provolatile => 'v', proparallel => 'u',
+  prorettype => 'bool', proargtypes => 'int4 text text',
+  prosrc => 'set_backend_config' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ee9968f81a..70b926a8d1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -833,7 +833,8 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_REMOTE_GUC
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 9f2f965d5c..040877f5eb 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -42,6 +42,9 @@ typedef enum
     PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
     PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
+    /* Remote GUC setting */
+    PROCSIG_REMOTE_GUC,
+
     NUM_PROCSIGNALS                /* Must be last! */
 } ProcSignalReason;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index c07e7b945e..1e12773906 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -193,7 +193,8 @@ typedef enum
     /* Types of set_config_option actions */
     GUC_ACTION_SET,                /* regular SET command */
     GUC_ACTION_LOCAL,            /* SET LOCAL command */
-    GUC_ACTION_SAVE                /* function SET option, or temp assignment */
+    GUC_ACTION_SAVE,            /* function SET option, or temp assignment */
+    GUC_ACTION_NONXACT            /* transactional setting */
 } GucAction;
 
 #define GUC_QUALIFIER_SEPARATOR '.'
@@ -269,6 +270,8 @@ extern int    tcp_keepalives_idle;
 extern int    tcp_keepalives_interval;
 extern int    tcp_keepalives_count;
 
+extern volatile bool RemoteGucChangePending;
+
 #ifdef TRACE_SORT
 extern bool trace_sort;
 #endif
@@ -276,6 +279,11 @@ extern bool trace_sort;
 /*
  * Functions exported by guc.c
  */
+extern Size GucShmemSize(void);
+extern void GucShmemInit(void);
+extern Datum set_backend_setting(PG_FUNCTION_ARGS);
+extern void HandleRemoteGucSetInterrupt(void);
+extern void HandleGucRemoteChanges(void);
 extern void SetConfigOption(const char *name, const char *value,
                 GucContext context, GucSource source);
 
@@ -395,6 +403,9 @@ extern Size EstimateGUCStateSpace(void);
 extern void SerializeGUCState(Size maxsize, char *start_address);
 extern void RestoreGUCState(void *gucstate);
 
+/* Remote GUC setting */
+extern void HandleGucRemoteChanges(void);
+
 /* Support for messages reported from GUC check hooks */
 
 extern PGDLLIMPORT char *GUC_check_errmsg_string;
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index a0970b2e1c..c00520e90c 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -115,7 +115,10 @@ typedef enum
     GUC_SAVE,                    /* entry caused by function SET option */
     GUC_SET,                    /* entry caused by plain SET command */
     GUC_LOCAL,                    /* entry caused by SET LOCAL command */
-    GUC_SET_LOCAL                /* entry caused by SET then SET LOCAL */
+    GUC_NONXACT,                /* entry caused by non-transactional ops */
+    GUC_SET_LOCAL,                /* entry caused by SET then SET LOCAL */
+    GUC_NONXACT_SET,            /* entry caused by NONXACT then SET */
+    GUC_NONXACT_LOCAL            /* entry caused by NONXACT then (SET)LOCAL */
 } GucStackState;
 
 typedef struct guc_stack
diff --git a/src/test/regress/expected/guc.out b/src/test/regress/expected/guc.out
index b0d7351145..2d19697a8c 100644
--- a/src/test/regress/expected/guc.out
+++ b/src/test/regress/expected/guc.out
@@ -476,6 +476,229 @@ SELECT '2006-08-13 12:34:56'::timestamptz;
  2006-08-13 12:34:56-07
 (1 row)
 
+-- NONXACT followed by SET, SET LOCAL through COMMIT
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+ work_mem 
+----------
+ 512kB
+(1 row)
+
+COMMIT;
+SHOW work_mem;    -- must see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+-- NONXACT followed by SET, SET LOCAL through ROLLBACK
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+ work_mem 
+----------
+ 512kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+-- SET, SET LOCAL followed by NONXACT through COMMIT
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+COMMIT;
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+-- SET, SET LOCAL followed by NONXACT through ROLLBACK
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+ set_config 
+------------
+ 128kB
+(1 row)
+
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+ work_mem 
+----------
+ 128kB
+(1 row)
+
+-- NONXACT and SAVEPOINT
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+ work_mem 
+----------
+ 384kB
+(1 row)
+
+COMMIT;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+ work_mem 
+----------
+ 384kB
+(1 row)
+
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+ set_config 
+------------
+ 256kB
+(1 row)
+
+SHOW work_mem;
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+COMMIT;
+SHOW work_mem; -- will see 256kB
+ work_mem 
+----------
+ 256kB
+(1 row)
+
+SET work_mem TO DEFAULT;
 --
 -- Test RESET.  We use datestyle because the reset value is forced by
 -- pg_regress, so it doesn't depend on the installation's configuration.
diff --git a/src/test/regress/sql/guc.sql b/src/test/regress/sql/guc.sql
index 3b854ac496..bbb91aaa98 100644
--- a/src/test/regress/sql/guc.sql
+++ b/src/test/regress/sql/guc.sql
@@ -133,6 +133,94 @@ SHOW vacuum_cost_delay;
 SHOW datestyle;
 SELECT '2006-08-13 12:34:56'::timestamptz;
 
+-- NONXACT followed by SET, SET LOCAL through COMMIT
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+COMMIT;
+SHOW work_mem;    -- must see 256kB
+
+-- NONXACT followed by SET, SET LOCAL through ROLLBACK
+BEGIN;
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SHOW work_mem;    -- must see 512kB
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+
+-- SET, SET LOCAL followed by NONXACT through COMMIT
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SHOW work_mem;    -- must see 128kB
+COMMIT;
+SHOW work_mem;    -- must see 128kB
+
+-- SET, SET LOCAL followed by NONXACT through ROLLBACK
+BEGIN;
+SET work_mem to '256kB';
+SET LOCAL work_mem to '512kB';
+SELECT set_config('work_mem', '128kB', false, true); -- NONXACT
+SHOW work_mem;    -- must see 128kB
+ROLLBACK;
+SHOW work_mem;    -- must see 128kB
+
+-- NONXACT and SAVEPOINT
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+COMMIT;
+SHOW work_mem; -- will see 256kB
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+RELEASE SAVEPOINT a;
+SHOW work_mem; -- will see 384kB
+ROLLBACK;
+SHOW work_mem; -- will see 256kB
+--
+SET work_mem TO '64kB';
+BEGIN;
+SET work_mem TO '128kB';
+SET LOCAL work_mem TO '384kB';
+SAVEPOINT a;
+SELECT set_config('work_mem', '256kB', false, true); -- NONXACT
+SHOW work_mem;
+SET LOCAL work_mem TO '384kB';
+ROLLBACK TO SAVEPOINT a;
+SHOW work_mem; -- will see 256kB
+COMMIT;
+SHOW work_mem; -- will see 256kB
+
+SET work_mem TO DEFAULT;
 --
 -- Test RESET.  We use datestyle because the reset value is forced by
 -- pg_regress, so it doesn't depend on the installation's configuration.
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Jan 17, 2019 at 2:48 PM Bruce Momjian <bruce@momjian.us> wrote:
> Well, I think everyone agrees there are workloads that cause undesired
> cache bloat.  What we have not found is a solution that doesn't cause
> code complexity or undesired overhead, or one that >1% of users will
> know how to use.
>
> Unfortunately, because we have not found something we are happy with, we
> have done nothing.  I agree LRU can be expensive.  What if we do some
> kind of clock sweep and expiration like we do for shared buffers?  I
> think the trick is figuring how frequently to do the sweep.  What if we
> mark entries as unused every 10 queries, mark them as used on first use,
> and delete cache entries that have not be used in the past 10 queries.

I still think wall-clock time is a perfectly reasonable heuristic.
Say every 5 or 10 minutes you walk through the cache.  Anything that
hasn't been touched since the last scan you throw away.  If you do
this, you MIGHT flush an entry that you're just about to need again,
but (1) it's not very likely, because if it hasn't been touched in
many minutes, the chances that it's about to be needed again are low,
and (2) even if it does happen, it probably won't cost all that much,
because *occasionally* reloading a cache entry unnecessarily isn't
that costly; the big problem is when you do it over and over again,
which can easily happen with a fixed size limit on the cache, and (3)
if somebody does have a workload where they touch the same object
every 11 minutes, we can give them a GUC to control the timeout
between cache sweeps and it's really not that hard to understand how
to set it.  And most people won't need to.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 17, 2019 at 2:48 PM Bruce Momjian <bruce@momjian.us> wrote:
>> Unfortunately, because we have not found something we are happy with, we
>> have done nothing.  I agree LRU can be expensive.  What if we do some
>> kind of clock sweep and expiration like we do for shared buffers?  I
>> think the trick is figuring how frequently to do the sweep.  What if we
>> mark entries as unused every 10 queries, mark them as used on first use,
>> and delete cache entries that have not be used in the past 10 queries.

> I still think wall-clock time is a perfectly reasonable heuristic.

The easy implementations of that involve putting gettimeofday() calls
into hot code paths, which would be a Bad Thing.  But maybe we could
do this only at transaction or statement start, and piggyback on the
gettimeofday() calls that already happen at those times.

            regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
"andres@anarazel.de"
Дата:
On 2019-01-18 15:57:17 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Jan 17, 2019 at 2:48 PM Bruce Momjian <bruce@momjian.us> wrote:
> >> Unfortunately, because we have not found something we are happy with, we
> >> have done nothing.  I agree LRU can be expensive.  What if we do some
> >> kind of clock sweep and expiration like we do for shared buffers?  I
> >> think the trick is figuring how frequently to do the sweep.  What if we
> >> mark entries as unused every 10 queries, mark them as used on first use,
> >> and delete cache entries that have not be used in the past 10 queries.
> 
> > I still think wall-clock time is a perfectly reasonable heuristic.
> 
> The easy implementations of that involve putting gettimeofday() calls
> into hot code paths, which would be a Bad Thing.  But maybe we could
> do this only at transaction or statement start, and piggyback on the
> gettimeofday() calls that already happen at those times.

My proposal for this was to attach a 'generation' to cache entries. Upon
access cache entries are marked to be of the current
generation. Whenever existing memory isn't sufficient for further cache
entries and, on a less frequent schedule, triggered by a timer, the
cache generation is increased and th new generation's "creation time" is
measured.  Then generations that are older than a certain threshold are
purged, and if there are any, the entries of the purged generation are
removed from the caches using a sequential scan through the cache.

This outline achieves:
- no additional time measurements in hot code paths
- no need for a sequential scan of the entire cache when no generations
  are too old
- both size and time limits can be implemented reasonably cheaply
- overhead when feature disabled should be close to zero

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
> My proposal for this was to attach a 'generation' to cache entries. Upon
> access cache entries are marked to be of the current
> generation. Whenever existing memory isn't sufficient for further cache
> entries and, on a less frequent schedule, triggered by a timer, the
> cache generation is increased and th new generation's "creation time" is
> measured.  Then generations that are older than a certain threshold are
> purged, and if there are any, the entries of the purged generation are
> removed from the caches using a sequential scan through the cache.
>
> This outline achieves:
> - no additional time measurements in hot code paths
> - no need for a sequential scan of the entire cache when no generations
>   are too old
> - both size and time limits can be implemented reasonably cheaply
> - overhead when feature disabled should be close to zero

Seems generally reasonable.  The "whenever existing memory isn't
sufficient for further cache entries" part I'm not sure about.
Couldn't that trigger very frequently and prevent necessary cache size
growth?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
"andres@anarazel.de"
Дата:
Hi,

On 2019-01-18 19:57:03 -0500, Robert Haas wrote:
> On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
> > My proposal for this was to attach a 'generation' to cache entries. Upon
> > access cache entries are marked to be of the current
> > generation. Whenever existing memory isn't sufficient for further cache
> > entries and, on a less frequent schedule, triggered by a timer, the
> > cache generation is increased and th new generation's "creation time" is
> > measured.  Then generations that are older than a certain threshold are
> > purged, and if there are any, the entries of the purged generation are
> > removed from the caches using a sequential scan through the cache.
> >
> > This outline achieves:
> > - no additional time measurements in hot code paths
> > - no need for a sequential scan of the entire cache when no generations
> >   are too old
> > - both size and time limits can be implemented reasonably cheaply
> > - overhead when feature disabled should be close to zero
> 
> Seems generally reasonable.  The "whenever existing memory isn't
> sufficient for further cache entries" part I'm not sure about.
> Couldn't that trigger very frequently and prevent necessary cache size
> growth?

I'm thinking it'd just trigger a new generation, with it's associated
"creation" time (which is cheap to acquire in comparison to creating a
number of cache entries) . Depending on settings or just code policy we
can decide up to which generation to prune the cache, using that
creation time.  I'd imagine that we'd have some default cache-pruning
time in the minutes, and for workloads where relevant one can make
sizing configurations more aggressive - or something like that.

Greetings,

Andres Freund


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> 0003: Remote GUC setting
> 
> It is independent from the above two, and heavily arguable.
> 
> pg_set_backend_config(pid, name, value) changes the GUC <name> on the
> backend with <pid> to <value>.
> 

Not having looked at the code yet, why did you think this is necessary?  Can't we always collect the cache stats?  Is
itheavy due to some locking in the shared memory, or sending the stats to the stats collector?
 


Regards
Takayuki Tsunakawa



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Fri, 18 Jan 2019 17:09:41 -0800, "andres@anarazel.de" <andres@anarazel.de> wrote in
<20190119010941.6ruftewah7t3k3yk@alap3.anarazel.de>
> Hi,
> 
> On 2019-01-18 19:57:03 -0500, Robert Haas wrote:
> > On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
> > > My proposal for this was to attach a 'generation' to cache entries. Upon
> > > access cache entries are marked to be of the current
> > > generation. Whenever existing memory isn't sufficient for further cache
> > > entries and, on a less frequent schedule, triggered by a timer, the
> > > cache generation is increased and th new generation's "creation time" is
> > > measured.  Then generations that are older than a certain threshold are
> > > purged, and if there are any, the entries of the purged generation are
> > > removed from the caches using a sequential scan through the cache.
> > >
> > > This outline achieves:
> > > - no additional time measurements in hot code paths

It is caused at every transaction start time and stored in
TimestampTz in this patch. No additional time measurement exists
already but cache puruing won't happen if a transaction lives for
a long time. Time-driven generation value, maybe with 10s-1min
fixed interval, is a possible option.

> > > - no need for a sequential scan of the entire cache when no generations
> > >   are too old

This patch didn't precheck against the oldest generation, but it
can be easily calculated. (But doesn't base on the creation time
but on the last-access time.) (Attached applies over the
v7-0001-Remove-entries-..patch)

Using generation time, entries are purged even if it is recently
accessed. I think last-accessed time is more sutable for the
purpse. On the other hand using last-accessed time, the oldest
generation can be stale by later access.

> > > - both size and time limits can be implemented reasonably cheaply
> > > - overhead when feature disabled should be close to zero

Overhead when disabled is already nothing since scanning is
inhibited when cache_prune_min_age is a negative value.

> > Seems generally reasonable.  The "whenever existing memory isn't
> > sufficient for further cache entries" part I'm not sure about.
> > Couldn't that trigger very frequently and prevent necessary cache size
> > growth?
> 
> I'm thinking it'd just trigger a new generation, with it's associated
> "creation" time (which is cheap to acquire in comparison to creating a
> number of cache entries) . Depending on settings or just code policy we
> can decide up to which generation to prune the cache, using that
> creation time.  I'd imagine that we'd have some default cache-pruning
> time in the minutes, and for workloads where relevant one can make
> sizing configurations more aggressive - or something like that.

The current patch uses last-accesed time by non-gettimeofday()
method. The genreation is fixed up to 3 and infrequently-accessed
entries are removed sooner. Generation interval is determined by
cache_prune_min_age.

Although this doesn't put a hard cap on memory usage, it is
indirectly and softly limited by the cache_prune_min_age and
cache_memory_target, which determins how large a cache can grow
until pruning happens. They are per-cache basis.

If we prefer to set a budget on all the syschaches (or even
including other caches), it would be more complex.

regares.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 4a3b3094a0..8274704af7 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -859,6 +859,7 @@ InitCatCache(int id,
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
     cp->cc_tupsize = 0;
+    cp->cc_noprune_until = 0;
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -898,6 +899,7 @@ CatCacheCleanupOldEntries(CatCache *cp)
     int            i;
     int            nremoved = 0;
     size_t        hash_size;
+    TimestampTz oldest_lastaccess = 0;
 #ifdef CATCACHE_STATS
     /* These variables are only for debugging purpose */
     int            ntotal = 0;
@@ -918,6 +920,10 @@ CatCacheCleanupOldEntries(CatCache *cp)
     if (cache_prune_min_age < 0)
         return false;
 
+    /* Return immediately if apparently no entry to remove */
+    if (cp->cc_noprune_until == 0 || catcacheclock <= cp->cc_noprune_until)
+        return false;
+
     /*
      * Return without pruning if the size of the hash is below the target.
      */
@@ -939,6 +945,7 @@ CatCacheCleanupOldEntries(CatCache *cp)
             CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
             long entry_age;
             int us;
+            bool removed = false;
 
 
             /*
@@ -982,12 +989,24 @@ CatCacheCleanupOldEntries(CatCache *cp)
                     {
                         CatCacheRemoveCTup(cp, ct);
                         nremoved++;
+                        removed = true;
                     }
                 }
             }
+
+            /* Take the oldest lastaccess among survived entries */
+            if (!removed &&
+                (oldest_lastaccess == 0 || ct->lastaccess < oldest_lastaccess))
+                oldest_lastaccess = ct->lastaccess;
         }
     }
 
+    /* Calculate the next pruning time if any entry remains */
+    if (oldest_lastaccess > 0)
+        oldest_lastaccess += cache_prune_min_age * USECS_PER_SEC;
+
+    cp->cc_noprune_until = oldest_lastaccess;
+
 #ifdef CATCACHE_STATS
     StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6,
                      "number of syscache age class must be 6");
@@ -1423,6 +1442,11 @@ SearchCatCacheInternal(CatCache *cache,
             ct->naccess++;
         ct->lastaccess = catcacheclock;
 
+        /* the first entry determines the next pruning time */
+        if (cache_prune_min_age >= 0 && cache->cc_noprune_until == 0)
+            cache->cc_noprune_until =
+                ct->lastaccess + cache_prune_min_age * USECS_PER_SEC;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 4d51975920..1750919399 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -63,7 +63,8 @@ typedef struct catcache
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
     int            cc_tupsize;        /* total amount of catcache tuples */
-
+    TimestampTz    cc_noprune_until; /* Skip pruning until this time has passed
+                                   * zero means no entry lives in this cache */
     /*
      * Statistics entries
      */

Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Thank you for pointing out the stupidity. (Tom did earlier, though.)

At Mon, 21 Jan 2019 07:12:41 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1FB6C78A@G01JPEXMBYT05>
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> > 0003: Remote GUC setting
> > 
> > It is independent from the above two, and heavily arguable.
> > 
> > pg_set_backend_config(pid, name, value) changes the GUC <name> on the
> > backend with <pid> to <value>.
> > 
> 
> Not having looked at the code yet, why did you think this is necessary?  Can't we always collect the cache stats?  Is
itheavy due to some locking in the shared memory, or sending the stats to the stats collector?
 

Yeah, I had a fun making it but I don't think it can be said very
good. I must admit that it is a kind of too-much or something
stupid.

Anyway it needs to scan the whole hash to collect numbers and I
don't see how to elimite the complexity without a penalty on
regular code paths for now. I don't want do that always for the
reason.

An option is an additional PGPROC member and interface functions.

struct PGPROC
{
 ...
 int syscahe_usage_track_interval; /* track interval, 0 to disable */

=# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]);
=# select syscahce_usage_track_remove(2134);


Or, just provide an one-shot triggering function.

=# select syscahce_take_usage_track(<pid>);

This can use both a similar PGPROC variable or SendProcSignal()
but the former doesn't fire while idle time unless using timer.


Any thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Bruce Momjian
Дата:
On Fri, Jan 18, 2019 at 05:09:41PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2019-01-18 19:57:03 -0500, Robert Haas wrote:
> > On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
> > > My proposal for this was to attach a 'generation' to cache entries. Upon
> > > access cache entries are marked to be of the current
> > > generation. Whenever existing memory isn't sufficient for further cache
> > > entries and, on a less frequent schedule, triggered by a timer, the
> > > cache generation is increased and th new generation's "creation time" is
> > > measured.  Then generations that are older than a certain threshold are
> > > purged, and if there are any, the entries of the purged generation are
> > > removed from the caches using a sequential scan through the cache.
> > >
> > > This outline achieves:
> > > - no additional time measurements in hot code paths
> > > - no need for a sequential scan of the entire cache when no generations
> > >   are too old
> > > - both size and time limits can be implemented reasonably cheaply
> > > - overhead when feature disabled should be close to zero
> > 
> > Seems generally reasonable.  The "whenever existing memory isn't
> > sufficient for further cache entries" part I'm not sure about.
> > Couldn't that trigger very frequently and prevent necessary cache size
> > growth?
> 
> I'm thinking it'd just trigger a new generation, with it's associated
> "creation" time (which is cheap to acquire in comparison to creating a
> number of cache entries) . Depending on settings or just code policy we
> can decide up to which generation to prune the cache, using that
> creation time.  I'd imagine that we'd have some default cache-pruning
> time in the minutes, and for workloads where relevant one can make
> sizing configurations more aggressive - or something like that.

OK, so it seems everyone likes the idea of a timer.  The open questions
are whether we want multiple epochs, and whether we want some kind of
size trigger.

With only one time epoch, if the timer is 10 minutes, you could expire an
entry after 10-19 minutes, while with a new epoch every minute and
10-minute expire, you can do 10-11 minute precision.  I am not sure the
complexity is worth it.

For a size trigger, should removal be effected by how many expired cache
entries there are?  If there were 10k expired entries or 50, wouldn't
you want them removed if they have not been accessed in X minutes?

In the worst case, if 10k entries were accessed in a query and never
accessed again, what would the ideal cleanup behavior be?  Would it
matter if it was expired in 10 or 19 minutes?  Would it matter if there
were only 50 entries?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> Although this doesn't put a hard cap on memory usage, it is indirectly and
> softly limited by the cache_prune_min_age and cache_memory_target, which
> determins how large a cache can grow until pruning happens. They are
> per-cache basis.
> 
> If we prefer to set a budget on all the syschaches (or even including other
> caches), it would be more complex.
> 

This is a pure question.  How can we answer these questions from users?

* What value can I set to cache_memory_target when I can use 10 GB for the caches and max_connections = 100?
* How much RAM do I need to have for the caches when I set cache_memory_target = 1M?

The user tends to estimate memory to avoid OOM.


Regards
Takayuki Tsunakawa





Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Mon, 21 Jan 2019 17:22:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190121.172255.226467552.horiguchi.kyotaro@lab.ntt.co.jp>
> An option is an additional PGPROC member and interface functions.
> 
> struct PGPROC
> {
>  ...
>  int syscahe_usage_track_interval; /* track interval, 0 to disable */
> 
> =# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]);
> =# select syscahce_usage_track_remove(2134);
> 
> 
> Or, just provide an one-shot triggering function.
> 
> =# select syscahce_take_usage_track(<pid>);
> 
> This can use both a similar PGPROC variable or SendProcSignal()
> but the former doesn't fire while idle time unless using timer.

The attached is revised version of this patchset, where the third
patch is the remote setting feature. It uses static shared memory.

=# select pg_backend_catcache_stats(<pid>, <millis>);

Activates or changes catcache stats feature on the backend with
PID. (The name should be changed to .._syscache_stats, though.)
It is far smaller than the remote-GUC feature. (It contains a
part that should be in the previous patch. I will fix it later.)


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 067f8ad60f259453271d2bf8323505beb5b9e0a9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   5 +
 src/backend/utils/cache/catcache.c            | 166 ++++++++++++++++++++++++--
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  28 ++++-
 6 files changed, 254 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..af3c52b868 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 18467d96d2..dbffec8067 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -733,7 +733,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     if (!IsParallelWorker())
+    {
         stmtStartTimestamp = GetCurrentTimestamp();
+
+        /* Set this timestamp as aproximated current time */
+        SetCatCacheClock(stmtStartTimestamp);
+    }
     else
         Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 8152f7e21e..ee40093553 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -72,9 +72,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -491,6 +506,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -842,6 +858,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -859,9 +876,129 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1275,6 +1412,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1820,11 +1962,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1843,13 +1987,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1877,8 +2022,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1899,17 +2044,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c216ed0922..134c357bf3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -80,6 +80,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2190,6 +2191,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..d82af3bd6c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #dynamic_shared_memory_type = posix    # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..5d24809900 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +193,28 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From f595a8a03f4438c52303c7fae3d95492550106b5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 15:48:28 +0900
Subject: [PATCH 2/3] Syscache usage tracking feature.

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  15 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 201 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            | 115 +++++++++++----
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   6 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 576 insertions(+), 45 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index af3c52b868..6dd024340b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6662,6 +6662,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval">
+      <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect system cache usage statistics in
+        milliseconds. This parameter is 0 by default, which means disabled.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f4d9e9daf7..30e2da935a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -904,6 +904,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1183,6 +1199,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 13da412c59..2e8b7d0d91 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 #include "utils/tqual.h"
 
@@ -125,6 +126,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -237,6 +239,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only databsae stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -336,6 +343,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_syscache_remove_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -631,10 +639,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -645,25 +656,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -684,8 +709,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -2962,6 +2988,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /* clear syscache statistics files and temprary settings */
+    if (MyBackendId != InvalidBackendId)
+        pgstat_syscache_remove_statsfile();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -4286,6 +4316,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6376,3 +6409,153 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* removes syscache stats files of this backend */
+static void
+pgstat_syscache_remove_statsfile(void)
+{
+    char    fname[MAXPGPATH];
+
+    pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                     fname, MAXPGPATH);
+    unlink(fname);        /* don't care of the result */
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and resturns the
+ * time remaining in the current interval in milliseconds. If'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    {
+        /* disabled. remove the statistics file if any */
+        if (last_report > 0)
+        {
+            last_report = 0;
+            pgstat_syscache_remove_statsfile();
+        }
+        return 0;
+    }
+    
+    /* Check aginst the interval */
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now update the stats */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0c0891b33e..e7972e645f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3157,6 +3157,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 }
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_catcache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_catcache_update_timeout = true;
+                    enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_catcache_update_timeout)
+        {
+            disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false);
+            disable_idle_catcache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 053bb73863..0d32bf8daa 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1882,3 +1885,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        fread(&cacheid, sizeof(int), 1, fpin);
+        fread(&last_update, sizeof(TimestampTz), 1, fpin);
+        if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index ee40093553..4a3b3094a0 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -90,6 +90,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Timestamp used for any operation on caches. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -620,9 +624,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -698,9 +700,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -907,10 +907,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
      * cache_prune_min_age. The index of nremoved_entry is the value of the
      * clock-sweep counter, which takes from 0 up to 2.
      */
-    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
-    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0};
     int            nremoved_entry[3] = {0, 0, 0};
     int            j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
 #endif
 
     /* Return immediately if no pruning is wanted */
@@ -924,7 +925,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
     
-    /* Search the whole hash for entries to remove */
+    /*
+     * Search the whole hash for entries to remove. This is a quite time
+     * consuming task during catcache lookup, but accetable since now we are
+     * going to expand the hash table.
+     */
     for (i = 0; i < cp->cc_nbuckets; i++)
     {
         dlist_mutable_iter iter;
@@ -937,21 +942,21 @@ CatCacheCleanupOldEntries(CatCache *cp)
 
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
 #ifdef CATCACHE_STATS
             /* count catcache entries for each age class */
             ntotal++;
-            for (j = 0 ;
-                 ageclass[j] != 0.0 &&
-                     entry_age > cache_prune_min_age * ageclass[j] ;
-                 j++);
-            if (ageclass[j] == 0.0) j--;
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > cache_prune_min_age * ageclass[j])
+                j++;
             nentries[j]++;
 #endif
 
@@ -984,14 +989,17 @@ CatCacheCleanupOldEntries(CatCache *cp)
     }
 
 #ifdef CATCACHE_STATS
+    StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6,
+                     "number of syscache age class must be 6");
     ereport(DEBUG1,
-            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d)
naccessed(0:%d,1:%d, 2:%d)",
 
                      nremoved, ntotal,
                      ageclass[0] * cache_prune_min_age, nentries[0],
                      ageclass[1] * cache_prune_min_age, nentries[1],
                      ageclass[2] * cache_prune_min_age, nentries[2],
                      ageclass[3] * cache_prune_min_age, nentries[3],
                      ageclass[4] * cache_prune_min_age, nentries[4],
+                     nentries[5],
                      nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
              errhidestmt(true)));
 #endif
@@ -1368,9 +1376,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1430,9 +1436,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1441,9 +1445,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1571,9 +1573,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1684,9 +1684,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1743,9 +1741,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2253,3 +2249,64 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /* cache_prune_min_age can be changed on-session, fill it every time */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..f039ecd805 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7415c4faab..6b0fdbbd87 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -629,6 +630,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1240,6 +1243,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 134c357bf3..e8d7b6998a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3154,6 +3154,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d82af3bd6c..4a6c9fceb5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -554,6 +554,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_syscache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3ecc2e12c3..11fc1f3075 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9669,6 +9669,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..69b9a976f0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 313ca5f3c3..4d0f5b8042 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1134,6 +1134,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1218,7 +1219,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1353,5 +1355,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 5d24809900..4d51975920 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -65,10 +65,8 @@ typedef struct catcache
     int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -81,7 +79,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..0ab441a364 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_CATCACHE_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e384cd2279..1991e75e97 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1919,6 +1919,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2350,7 +2372,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3

From 36f93fb3625e8f1753070d30ec81548a4dfe9eb1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 23 Jan 2019 17:32:03 +0900
Subject: [PATCH 3/3] Remote setting feature for catcache statitics.

---
 src/backend/postmaster/pgstat.c    |  26 ++++++---
 src/backend/storage/ipc/ipci.c     |   3 +
 src/backend/utils/cache/catcache.c | 116 +++++++++++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat    |   8 +++
 src/include/utils/catcache.h       |   6 ++
 5 files changed, 151 insertions(+), 8 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2e8b7d0d91..338a407552 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -62,6 +62,7 @@
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
+#include "utils/catcache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/rel.h"
@@ -343,7 +344,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-static void pgstat_syscache_remove_statsfile(void);
+static void pgstat_remove_syscache_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -2990,7 +2991,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /* clear syscache statistics files and temprary settings */
     if (MyBackendId != InvalidBackendId)
-        pgstat_syscache_remove_statsfile();
+    {
+        pgstat_remove_syscache_statsfile();
+        SetCatcacheStatsParam(0);
+    }
 
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
@@ -6432,7 +6436,7 @@ pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
 
 /* removes syscache stats files of this backend */
 static void
-pgstat_syscache_remove_statsfile(void)
+pgstat_remove_syscache_statsfile(void)
 {
     char    fname[MAXPGPATH];
 
@@ -6461,15 +6465,21 @@ pgstat_write_syscache_stats(bool force)
     FILE    *fpout;
     char    statfile[MAXPGPATH];
     char    tmpfile[MAXPGPATH];
+    int        interval = pgstat_track_syscache_usage_interval;
+    int        interval_by_remote = GetCatcacheStatsParam();
+
+    /* remote setting overrides if any */
+    if (interval_by_remote > 0)
+        interval = interval_by_remote;
 
     /* Return if we don't want it */
-    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    if (!force && interval <= 0)
     {
         /* disabled. remove the statistics file if any */
         if (last_report > 0)
         {
             last_report = 0;
-            pgstat_syscache_remove_statsfile();
+            pgstat_remove_syscache_statsfile();
         }
         return 0;
     }
@@ -6479,10 +6489,10 @@ pgstat_write_syscache_stats(bool force)
     TimestampDifference(last_report, now, &secs, &usecs);
     elapsed = secs * 1000 + usecs / 1000;
 
-    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    if (!force && elapsed < interval)
     {
         /* not yet the time, inform the remaining time to the caller */
-        return pgstat_track_syscache_usage_interval - elapsed;
+        return interval - elapsed;
     }
 
     /* now update the stats */
@@ -6511,7 +6521,7 @@ pgstat_write_syscache_stats(bool force)
          * tell caller to wait for the next interval.
          */
         RESUME_INTERRUPTS();
-        return pgstat_track_syscache_usage_interval;
+        return interval;
     }
 
     /* write out every catcache stats */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2849e47d99..be5ee1f4ff 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -44,6 +44,7 @@
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
+#include "utils/catcache.h"
 #include "utils/snapmgr.h"
 
 
@@ -148,6 +149,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, CatcacheStatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +269,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    CatcacheStatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 4a3b3094a0..7ff8cd22ca 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -22,14 +22,17 @@
 #include "access/tuptoaster.h"
 #include "access/valid.h"
 #include "access/xact.h"
+#include "catalog/pg_authid.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #ifdef CATCACHE_STATS
 #include "storage/ipc.h"        /* for on_proc_exit */
 #endif
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/fmgroids.h"
@@ -94,6 +97,17 @@ TimestampTz    catcacheclock = 0;
 static double ageclass[SYSCACHE_STATS_NAGECLASSES]
     = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
 
+/* remote commanding facility */
+typedef struct CatcacheStatsParam
+{
+    int    interval;
+} CatcacheStatsParam;
+
+#define NumCatcacheStatsParam (MaxBackends + NUM_AUXPROCTYPES)
+
+static slock_t CatcacheStatsParamLock;
+static CatcacheStatsParam *CatcacheStatsParamArray = NULL;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -2310,3 +2324,105 @@ CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
         }
     }
 }
+
+/* Report shared-memory space needed */
+Size
+CatcacheStatsShmemSize(void)
+{
+    Size size;
+
+    /* The same number of elements with backend status array */
+    size = mul_size(sizeof(CatcacheStatsParam), NumCatcacheStatsParam);
+
+    return size;
+}
+
+/* Initialize the shared parameter array for catcache statistics */
+void
+CatcacheStatsShmemInit(void)
+{
+    Size size;
+    bool found;
+
+    size = CatcacheStatsShmemSize();
+    CatcacheStatsParamArray = (CatcacheStatsParam *)
+        ShmemInitStruct("Backend Catcache Statistics Parameter Array",
+                        size, &found);
+
+    if (!found)
+    {
+        /* We're the first, initilize it */
+        MemSet(CatcacheStatsParamArray, 0, size);
+    }
+
+    SpinLockInit(&CatcacheStatsParamLock);
+}
+
+/*
+ * SQL callable function to take catcache statistics of another backend
+ */
+Datum
+backend_catcache_stats(PG_FUNCTION_ARGS)
+{
+    int    target_pid = PG_GETARG_INT32(0);
+    int interval = PG_GETARG_INT32(1);
+    PGPROC *target_proc;
+
+    if (interval < 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("interval must not be a negtive number")));
+
+    LWLockAcquire(ProcArrayLock, LW_SHARED);
+    target_proc = BackendPidGetProcWithLock(target_pid);
+
+    if (target_proc == NULL)
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("PID %d is not a PostgreSQL server process",
+                        target_pid)));
+
+    /* The same condition to pg_signal_backend() */
+    if ((superuser_arg(target_proc->roleId) && !superuser()) ||
+        (!has_privs_of_role(GetUserId(), target_proc->roleId) &&
+         !has_privs_of_role(GetUserId(), DEFAULT_ROLE_SIGNAL_BACKENDID)))
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("permission denied")));
+
+    if (target_proc->backendId == InvalidBackendId)
+        ereport(ERROR,
+                (errmsg("invalid backendid")));
+
+    SpinLockAcquire(&CatcacheStatsParamLock);
+    CatcacheStatsParamArray[target_proc->backendId - 1].interval = interval;
+    SpinLockRelease(&CatcacheStatsParamLock);
+    LWLockRelease(ProcArrayLock);
+
+    PG_RETURN_VOID();
+}
+
+/* returns catcache stats paramter of this backend */
+int
+GetCatcacheStatsParam(void)
+{
+    int interval;
+
+    Assert(MyBackendId != InvalidBackendId);
+
+    SpinLockAcquire(&CatcacheStatsParamLock);
+    interval = CatcacheStatsParamArray[MyBackendId - 1].interval;
+    SpinLockRelease(&CatcacheStatsParamLock);
+
+    return interval;
+}    
+
+/* sets catcache stats paramter of this backend */
+void
+SetCatcacheStatsParam(int interval)
+{
+    Assert(MyBackendId != InvalidBackendId);
+    SpinLockAcquire(&CatcacheStatsParamLock);
+    CatcacheStatsParamArray[MyBackendId - 1].interval = interval;
+    SpinLockRelease(&CatcacheStatsParamLock);
+}    
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 11fc1f3075..8011d94d5d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10518,4 +10518,12 @@
   proargnames => '{rootrelid,relid,parentrelid,isleaf,level}',
   prosrc => 'pg_partition_tree' },
 
+# catcache statitics
+{ oid => '3424',
+  descr => 'take backend statistics of another backend',
+  proname => 'pg_backend_catcache_stats', proisstrict => 'f',
+  proretset => 'f', provolatile => 'v', proparallel => 'u',
+  prorettype => 'void', proargtypes => 'int4 int4',
+  prosrc => 'backend_catcache_stats' },
+
 ]
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 4d51975920..69031f1a5e 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -24,6 +24,7 @@
 #include "access/skey.h"
 #include "datatype/timestamp.h"
 #include "lib/ilist.h"
+#include "utils/catcache.h"
 #include "utils/relcache.h"
 
 /*
@@ -212,6 +213,9 @@ GetCatCacheClock(void)
     return catcacheclock;
 }
 
+extern Size CatcacheStatsShmemSize(void);
+extern void CatcacheStatsShmemInit(void);
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
@@ -254,5 +258,7 @@ extern void PrintCatCacheListLeakWarning(CatCList *list);
 /* defined in syscache.h */
 typedef struct syscachestats SysCacheStats;
 extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+extern void SetCatcacheStatsParam(int interval);
+extern int GetCatcacheStatsParam(void);
 
 #endif                            /* CATCACHE_H */
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Bruce Momjian
Дата:
On Wed, Jan 23, 2019 at 05:35:02PM +0900, Kyotaro HORIGUCHI wrote:
> At Mon, 21 Jan 2019 17:22:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20190121.172255.226467552.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > An option is an additional PGPROC member and interface functions.
> > 
> > struct PGPROC
> > {
> >  ...
> >  int syscahe_usage_track_interval; /* track interval, 0 to disable */
> > 
> > =# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]);
> > =# select syscahce_usage_track_remove(2134);
> > 
> > 
> > Or, just provide an one-shot triggering function.
> > 
> > =# select syscahce_take_usage_track(<pid>);
> > 
> > This can use both a similar PGPROC variable or SendProcSignal()
> > but the former doesn't fire while idle time unless using timer.
> 
> The attached is revised version of this patchset, where the third
> patch is the remote setting feature. It uses static shared memory.
> 
> =# select pg_backend_catcache_stats(<pid>, <millis>);
> 
> Activates or changes catcache stats feature on the backend with
> PID. (The name should be changed to .._syscache_stats, though.)
> It is far smaller than the remote-GUC feature. (It contains a
> part that should be in the previous patch. I will fix it later.)

I have a few questions to make sure we have not made the API too
complex.  First, for syscache_prune_min_age, that is the minimum age
that we prune, and entries could last twice that long.  Is there any
value to doing the scan at 50% of the age so that the
syscache_prune_min_age is the max age?  For example, if our age cutoff
is 10 minutes, we could scan every 5 minutes so 10 minutes would be the
maximum age kept.

Second, when would you use syscache_memory_target != 0?  If you had
syscache_prune_min_age really fast, e.g. 10 seconds?  What is the
use-case for this?  You have a query that touches 10k objects, and then
the connection stays active but doesn't touch many of those 10k objects,
and you want it cleaned up in seconds instead of minutes?  (I can't see
why you would not clean up all unreferenced objects after _minutes_ of
disuse, but removing them after seconds of disuse seems undesirable.)
What are the odds you would retain the entires you want with a fast
target?

What is the value of being able to change a specific backend's stat
interval?  I don't remember any other setting having this ability.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Thank you for the comments.

At Wed, 23 Jan 2019 18:21:45 -0500, Bruce Momjian <bruce@momjian.us> wrote in <20190123232145.GA8334@momjian.us>
> On Wed, Jan 23, 2019 at 05:35:02PM +0900, Kyotaro HORIGUCHI wrote:
> > At Mon, 21 Jan 2019 17:22:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20190121.172255.226467552.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > An option is an additional PGPROC member and interface functions.
> > > 
> > > struct PGPROC
> > > {
> > >  ...
> > >  int syscahe_usage_track_interval; /* track interval, 0 to disable */
> > > 
> > > =# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]);
> > > =# select syscahce_usage_track_remove(2134);
> > > 
> > > 
> > > Or, just provide an one-shot triggering function.
> > > 
> > > =# select syscahce_take_usage_track(<pid>);
> > > 
> > > This can use both a similar PGPROC variable or SendProcSignal()
> > > but the former doesn't fire while idle time unless using timer.
> > 
> > The attached is revised version of this patchset, where the third
> > patch is the remote setting feature. It uses static shared memory.
> > 
> > =# select pg_backend_catcache_stats(<pid>, <millis>);
> > 
> > Activates or changes catcache stats feature on the backend with
> > PID. (The name should be changed to .._syscache_stats, though.)
> > It is far smaller than the remote-GUC feature. (It contains a
> > part that should be in the previous patch. I will fix it later.)
> 
> I have a few questions to make sure we have not made the API too
> complex.  First, for syscache_prune_min_age, that is the minimum age
> that we prune, and entries could last twice that long.  Is there any
> value to doing the scan at 50% of the age so that the
> syscache_prune_min_age is the max age?  For example, if our age cutoff
> is 10 minutes, we could scan every 5 minutes so 10 minutes would be the
> maximum age kept.

(Looking into the patch..) Actually thrice, not twice.  It is
because I put significance on the access frequency. I think it is
reasonable that the entries with more frequent access gets longer
life (within a certain limit). The original problem here was
negative caches that are created but never accessed. However,
there's no firm reason for the number of the steps (3). There
might be no difference if the extra life time were up to once of
s_p_m_age or even with no extra time.

> Second, when would you use syscache_memory_target != 0? 

It is a suggestion upthread, we sometimes want to keep some known
amount of caches despite that expration should be activated.

> If you had
> syscache_prune_min_age really fast, e.g. 10 seconds?  What is the
> use-case for this? You have a query that touches 10k objects, and then
> the connection stays active but doesn't touch many of those 10k objects,
> and you want it cleaned up in seconds instead of minutes?  (I can't see
> why you would not clean up all unreferenced objects after _minutes_ of
> disuse, but removing them after seconds of disuse seems undesirable.)
> What are the odds you would retain the entires you want with a fast
> target?

Do you asking the reason for the unit? It's just because it won't
be so large even in seconds, to the utmost 3600 seconds.  Even
though I don't think such a short dutaion setting is meaningful
in the real world, either I don't think we need to inhibit
that. (Actually it is useful for testing:p) Another reason is
that GUC_UNIT_MIN doesn't seem so common that it is used only by
two variables, log_rotation_age and old_snapshot_threshold.

> What is the value of being able to change a specific backend's stat
> interval?  I don't remember any other setting having this ability.

As mentioned upthread, it takes significant time to take
statistics so I believe no one is willing to turn it on at all
times.  As the result it should be useless because it cannot be
turned on on an active backend when it actually gets bloat. So I
wanted to provide a remote switching feture.

I also thought that there's some other features that is useful if
it could be turned on remotely so the remote GUC feature but it
was too complex...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Bruce Momjian
Дата:
On Thu, Jan 24, 2019 at 06:39:24PM +0900, Kyotaro HORIGUCHI wrote:
> > Second, when would you use syscache_memory_target != 0? 
> 
> It is a suggestion upthread, we sometimes want to keep some known
> amount of caches despite that expration should be activated.
> 
> > If you had
> > syscache_prune_min_age really fast, e.g. 10 seconds?  What is the
> > use-case for this? You have a query that touches 10k objects, and then
> > the connection stays active but doesn't touch many of those 10k objects,
> > and you want it cleaned up in seconds instead of minutes?  (I can't see
> > why you would not clean up all unreferenced objects after _minutes_ of
> > disuse, but removing them after seconds of disuse seems undesirable.)
> > What are the odds you would retain the entires you want with a fast
> > target?
> 
> Do you asking the reason for the unit? It's just because it won't
> be so large even in seconds, to the utmost 3600 seconds.  Even
> though I don't think such a short dutaion setting is meaningful
> in the real world, either I don't think we need to inhibit
> that. (Actually it is useful for testing:p) Another reason is

We have gone from ignoring the cache bloat problem to designing an API
that even we don't know what value they provide, and if we don't know,
we can be sure our users will not know.  Every GUC has a cost, even if
it is not used.

I suggest you go with just syscache_prune_min_age, get that into PG 12,
and we can then reevaluate what we need.  If you want to hard-code a
minimum cache size where no pruning will happen, maybe based on the system
catalogs or typical load, that is fine.

> that GUC_UNIT_MIN doesn't seem so common that it is used only by
> two variables, log_rotation_age and old_snapshot_threshold.
> 
> > What is the value of being able to change a specific backend's stat
> > interval?  I don't remember any other setting having this ability.
> 
> As mentioned upthread, it takes significant time to take
> statistics so I believe no one is willing to turn it on at all
> times.  As the result it should be useless because it cannot be
> turned on on an active backend when it actually gets bloat. So I
> wanted to provide a remote switching feture.
> 
> I also thought that there's some other features that is useful if
> it could be turned on remotely so the remote GUC feature but it
> was too complex...

Well, I am thinking if we want to do something like this, we should do
it for all GUCs, not just for this one, so I suggest we not do this now
either.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> On Thu, Jan 24, 2019 at 06:39:24PM +0900, Kyotaro HORIGUCHI wrote:
>> I also thought that there's some other features that is useful if
>> it could be turned on remotely so the remote GUC feature but it
>> was too complex...

> Well, I am thinking if we want to do something like this, we should do
> it for all GUCs, not just for this one, so I suggest we not do this now
> either.

I will argue hard that we should not do it at all, ever.

There is already a mechanism for broadcasting global GUC changes:
apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP.
I do not think we need something that can remotely change a GUC's
value in just one session.  The potential for bugs, misuse, and
just plain confusion is enormous, and the advantage seems minimal.

            regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Jan 24, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Thu, Jan 24, 2019 at 06:39:24PM +0900, Kyotaro HORIGUCHI wrote:
> >> I also thought that there's some other features that is useful if
> >> it could be turned on remotely so the remote GUC feature but it
> >> was too complex...
>
> > Well, I am thinking if we want to do something like this, we should do
> > it for all GUCs, not just for this one, so I suggest we not do this now
> > either.
>
> I will argue hard that we should not do it at all, ever.
>
> There is already a mechanism for broadcasting global GUC changes:
> apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP.
> I do not think we need something that can remotely change a GUC's
> value in just one session.  The potential for bugs, misuse, and
> just plain confusion is enormous, and the advantage seems minimal.

I think there might be some merit in being able to activate debugging
or tracing facilities for a particular session remotely, but designing
something that will do that sort of thing well seems like a very
complex problem that certainly should not be sandwiched into another
patch that is mostly about something else.  And if we ever get such a
thing I suspect it should be entirely separate from the GUC system.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Thu, Jan 24, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I will argue hard that we should not do it at all, ever.
> >
> > There is already a mechanism for broadcasting global GUC changes:
> > apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP.
> > I do not think we need something that can remotely change a GUC's
> > value in just one session.  The potential for bugs, misuse, and
> > just plain confusion is enormous, and the advantage seems minimal.
> 
> I think there might be some merit in being able to activate debugging
> or tracing facilities for a particular session remotely, but designing
> something that will do that sort of thing well seems like a very
> complex problem that certainly should not be sandwiched into another
> patch that is mostly about something else.  And if we ever get such a
> thing I suspect it should be entirely separate from the GUC system.

+1 for a separate patch for remote session configuration.  ALTER SYSTEM + SIGHUP targeted at a particular backend would
doif the DBA can log into the database server (so, it can't be used for DBaaS.)  It would be useful to have
pg_reload_conf(pid).


Regards
Takayuki Tsunakawa



RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
Hi Horiguchi-san, Bruce,

From: Bruce Momjian [mailto:bruce@momjian.us]
> I suggest you go with just syscache_prune_min_age, get that into PG 12,
> and we can then reevaluate what we need.  If you want to hard-code a
> minimum cache size where no pruning will happen, maybe based on the system
> catalogs or typical load, that is fine.

Please forgive me if I say something silly (I might have got lost.)

Are you suggesting to make the cache size limit system-defined and uncontrollable by the user?  I think it's necessary
forthe DBA to be able to control the cache memory amount.  Otherwise, if many concurrent connections access many
partitionswithin a not-so-long duration, then the cache eviction can't catch up and ends up in OOM.  How about the
followingquestions I asked in my previous mail?
 

--------------------------------------------------
This is a pure question.  How can we answer these questions from users?

* What value can I set to cache_memory_target when I can use 10 GB for the caches and max_connections = 100?
* How much RAM do I need to have for the caches when I set cache_memory_target = 1M?

The user tends to estimate memory to avoid OOM.
--------------------------------------------------


Regards
Takayuki Tsunakawa






Re: Protect syscache from bloating with negative cache entries

От
'Bruce Momjian'
Дата:
On Fri, Jan 25, 2019 at 08:14:19AM +0000, Tsunakawa, Takayuki wrote:
> Hi Horiguchi-san, Bruce,
>
> From: Bruce Momjian [mailto:bruce@momjian.us]
> > I suggest you go with just syscache_prune_min_age, get that into
> > PG 12, and we can then reevaluate what we need.  If you want to
> > hard-code a minimum cache size where no pruning will happen, maybe
> > based on the system catalogs or typical load, that is fine.
>
> Please forgive me if I say something silly (I might have got lost.)
>
> Are you suggesting to make the cache size limit system-defined and
> uncontrollable by the user?  I think it's necessary for the DBA to
> be able to control the cache memory amount.  Otherwise, if many
> concurrent connections access many partitions within a not-so-long
> duration, then the cache eviction can't catch up and ends up in OOM.
> How about the following questions I asked in my previous mail?
>
> ----------------------------------------------------------------------
> This is a pure question.  How can we answer these questions from
> users?
>
> * What value can I set to cache_memory_target when I can use 10 GB for
> * the caches and max_connections = 100?  How much RAM do I need to
> * have for the caches when I set cache_memory_target = 1M?
>
> The user tends to estimate memory to avoid OOM.

Well, let's walk through this.  Suppose the default for
syscache_prune_min_age is 10 minutes, and that we prune all cache
entries unreferenced in the past 10 minutes, or we only prune every 10
minutes if the cache size is larger than some fixed size like 100.

So, when would you change syscache_prune_min_age?  If you reference many
objects and then don't reference them at all for minutes, you might want
to lower syscache_prune_min_age to maybe 1 minute.  Why would you want
to change the behavior of removing all unreferenced cache items, at
least when there are more than 100?  (You called this
syscache_memory_target.)

My point is I can see someone wanting to change syscache_prune_min_age,
but I can't see someone wanting to change syscache_memory_target.  Who
would want to keep 5k cache entries that have not been accessed in X
minutes?  If we had some global resource manager that would allow you to
control work_mem, maintenance_work_mem, cache size, and set global
limits on their sizes, I can see where maybe it might make sense, but
right now the memory usage of a backend is so fluid that setting some
limit on its size  for unreferenced entries just doesn't make sense.

One of my big points is that syscache_memory_target doesn't even
guarantee that the cache will be this size or lower, it only controls
whether the cleanup happens at syscache_prune_min_age intervals.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Fri, 25 Jan 2019 08:14:19 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1FB70EFB@G01JPEXMBYT05>
> Hi Horiguchi-san, Bruce,
> 
> From: Bruce Momjian [mailto:bruce@momjian.us]
> > I suggest you go with just syscache_prune_min_age, get that into PG 12,
> > and we can then reevaluate what we need.  If you want to hard-code a
> > minimum cache size where no pruning will happen, maybe based on the system
> > catalogs or typical load, that is fine.
> 
> Please forgive me if I say something silly (I might have got lost.)
> 
> Are you suggesting to make the cache size limit system-defined and uncontrollable by the user?  I think it's
necessaryfor the DBA to be able to control the cache memory amount.  Otherwise, if many concurrent connections access
manypartitions within a not-so-long duration, then the cache eviction can't catch up and ends up in OOM.  How about the
followingquestions I asked in my previous mail?
 

cache_memory_target does the opposit of limiting memory usage. It
keeps some amount of syscahe entries unpruned. It is intended for
sessions on where cache-effective queries runs intermittently.
syscache_prune_min_age also doesn't directly limit the size. It
just eventually prevents infinite memory consumption.

The knobs are not no-brainer at all and don't need tuning in most
cases.

> --------------------------------------------------
> This is a pure question.  How can we answer these questions from users?
> 
> * What value can I set to cache_memory_target when I can use 10 GB for the caches and max_connections = 100?
> * How much RAM do I need to have for the caches when I set cache_memory_target = 1M?
> 
> The user tends to estimate memory to avoid OOM.
> --------------------------------------------------

You don't have a direct control on syscache memory usage. When
you find a queriy slowed by the default cache expiration, you can
set cache_memory_taret to keep them for intermittent execution of
a query, or you can increase syscache_prune_min_age to allow
cache live for a longer time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Fri, 25 Jan 2019 07:26:46 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1FB70E6B@G01JPEXMBYT05>
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> > On Thu, Jan 24, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > I will argue hard that we should not do it at all, ever.
> > >
> > > There is already a mechanism for broadcasting global GUC changes:
> > > apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP.
> > > I do not think we need something that can remotely change a GUC's
> > > value in just one session.  The potential for bugs, misuse, and
> > > just plain confusion is enormous, and the advantage seems minimal.
> > 
> > I think there might be some merit in being able to activate debugging
> > or tracing facilities for a particular session remotely, but designing
> > something that will do that sort of thing well seems like a very
> > complex problem that certainly should not be sandwiched into another
> > patch that is mostly about something else.  And if we ever get such a
> > thing I suspect it should be entirely separate from the GUC system.

It means that we have a lesser copy of the GUC system but can be
set remotely, then some features explicitly register their own
knob on the new system, with the name that I suspenct it should
be the same to the related GUC (for users' convenient).

> +1 for a separate patch for remote session configuration.

It sounds reasnable for me. As I said there should be some such
variables.

> ALTER SYSTEM + SIGHUP targeted at a particular backend would do
> if the DBA can log into the database server (so, it can't be
> used for DBaaS.)  It would be useful to have
> pg_reload_conf(pid).

I don't think it is reasonable. ALTER SYSTEM alters a *system*
configuration which is assumed to be the same on all sessions and
other processes. All sessions start the syscache tracking if
another ALTER SYSTEM for another variable then pg_reload_conf()
come after doing the above. I think the change should persist no
longer than the session-lifetime.

I think that a consensus on backend-targetted remote tuning is
made here:)

A. Let GUC variables settable by a remote session.

  A-1. Variables are changed at a busy time (my first patch).
      (transaction-awareness of GUC makes this complex)

  A-2. Variables are changed when the session is idle (or outside
       a transaction).

B. Override some variables via values laid on shared memory. (my
   second or the last patch).

    Very specific to a target feature. I think it consumes a bit
    too large memory.

C. Provide session-specific GUC variable (that overides the global one)

   - Add new configuration file "postgresql.conf.<PID>" and
     pg_reload_conf() let the session with the PID loads it as if
     it is the last included file. All such files are removed at
     startup or at the end of the coressponding session.

   - Add a new syntax like this:
     ALTER SESSION WITH (pid=xxxx)
        SET configuration_parameter {TO | =} {value | 'value' | DEFAULT}
        RESET configuration_parameter
        RESET ALL

   - Target variables are marked with GUC_REMOTE.

I'll consider the last choice and will come up with a patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
Hi

>> > I suggest you go with just syscache_prune_min_age, get that into PG
>> > 12, and we can then reevaluate what we need.  If you want to
>> > hard-code a minimum cache size where no pruning will happen, maybe
>> > based on the system catalogs or typical load, that is fine.
>>
>> Please forgive me if I say something silly (I might have got lost.)
>>
>> Are you suggesting to make the cache size limit system-defined and uncontrollable
>by the user?  I think it's necessary for the DBA to be able to control the cache memory
>amount.  Otherwise, if many concurrent connections access many partitions within a
>not-so-long duration, then the cache eviction can't catch up and ends up in OOM.
>How about the following questions I asked in my previous mail?
>
>cache_memory_target does the opposit of limiting memory usage. It keeps some
>amount of syscahe entries unpruned. It is intended for sessions on where
>cache-effective queries runs intermittently.
>syscache_prune_min_age also doesn't directly limit the size. It just eventually
>prevents infinite memory consumption.
>
>The knobs are not no-brainer at all and don't need tuning in most cases.
>
>> --------------------------------------------------
>> This is a pure question.  How can we answer these questions from users?
>>
>> * What value can I set to cache_memory_target when I can use 10 GB for the
>caches and max_connections = 100?
>> * How much RAM do I need to have for the caches when I set cache_memory_target
>= 1M?
>>
>> The user tends to estimate memory to avoid OOM.
>> --------------------------------------------------
>
>You don't have a direct control on syscache memory usage. When you find a queriy
>slowed by the default cache expiration, you can set cache_memory_taret to keep
>them for intermittent execution of a query, or you can increase
>syscache_prune_min_age to allow cache live for a longer time.
>


In current ver8 patch there is a stats view representing age class distribution.
https://www.postgresql.org/message-id/20181019.173457.68080786.horiguchi.kyotaro%40lab.ntt.co.jp
Does it help DBA with tuning cache_prune_age and/or cache_prune_target?
If the amount of cache entries of older age class is large, are people supposed to lower prune_age and 
not to change cache_prune_target?
(I get confusion a little bit.)

Regards,
Takeshi Ideriha



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Wed, 30 Jan 2019 05:06:30 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in
<4E72940DA2BF16479384A86D54D0988A6F4156D4@G01JPEXMBKW04>
> >You don't have a direct control on syscache memory usage. When you find a queriy
> >slowed by the default cache expiration, you can set cache_memory_taret to keep
> >them for intermittent execution of a query, or you can increase
> >syscache_prune_min_age to allow cache live for a longer time.
> 
> In current ver8 patch there is a stats view representing age class distribution.
> https://www.postgresql.org/message-id/20181019.173457.68080786.horiguchi.kyotaro%40lab.ntt.co.jp
> Does it help DBA with tuning cache_prune_age and/or cache_prune_target?

Definitely. At least DBA can see nothing about cache usage.

> If the amount of cache entries of older age class is large, are people supposed to lower prune_age and 
> not to change cache_prune_target?
> (I get confusion a little bit.)

This feature just removes cache entries that have not accessed
for a certain time.

If older entries occupies the major portion, it means that
syscache is used effectively (in other words most of the entries
are accessed frequently enough.) And in that case I believe
syscache doesn't put pressure to memory usage. If the total
memory usage exceeds expectations in the case, reducing pruning
age may reduce it but not necessarily. Extremely short pruning
age will work in exchange for performance degradation.

If newer entries occupies the major portion, it means that
syscache may not be used effectively. The total amount of memory
usage will be limited by puruning feature so tuning won't be
needed.

In both cases, if pruning causes slowdown of intermittent large
queries, cache_memory_target will alleviate the slowdown.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Michael Paquier
Дата:
On Mon, Jan 28, 2019 at 01:31:43PM +0900, Kyotaro HORIGUCHI wrote:
> I'll consider the last choice and will come up with a patch.

Update is recent, so I have just moved the patch to next CF.
--
Michael

Вложения

RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
Horiguchi-san, Bruce,

Thank you for telling me your ideas behind this feature.  Frankly, I don't think I understood the proposed
specificationis OK, but I can't explain it well at this instant.  So, let me discuss that in a subsequent mail.
 

Anyway, here are my review comments on 0001:


(1)

(1)
+/* GUC variable to define the minimum age of entries that will be cosidered to
+    /* initilize catcache reference clock if haven't done yet */

cosidered -> considered
initilize -> initialize

I remember I saw some other wrong spelling and/or missing words, which I forgot (sorry).


(2)
Only the doc prefixes "sys" to the new parameter names.  Other places don't have it.  I think we should prefix sys,
becauserelcache and plancache should be configurable separately because of their different usage patterns/lifecycle.
 


(3)
The doc doesn't describe the unit of syscache_memory_target.  Kilobytes?


(4)
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        tupsize = sizeof(CatCTup);

GetMemoryChunkSpace() should be used to include the memory context overhead.  That's what the files in
src/backend/utils/sort/do.
 


(5)
+            if (entry_age > cache_prune_min_age)

">=" instead of ">"?


(6)
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);

It's better to write "ct->c_list == NULL" to follow the style in this file.

"ct->refcount == 0" should also be checked prior to removing the catcache tuple, just in case the tuple hasn't been
releasedfor a long time, which might hardly happen.
 


(7)
CatalogCacheCreateEntry

+    int            tupsize = 0;
     if (ntp)
     {
         int            i;
+        int            tupsize;

tupsize is defined twice.



(8)
CatalogCacheCreateEntry

In the negative entry case, the memory allocated by CatCacheCopyKeys() is not counted.  I'm afraid that's not
negligible.


(9)
The memory for CatCList is not taken into account for syscache_memory_target.


Regards
Takayuki Tsunakawa




RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
Horiguchi-san, Bruce, all,

I hesitate to say this, but I think there are the following problems with the proposed approach:

1) Tries to prune the catalog tuples only when the hash table is about to expand.
If no tuple is found to be eligible for eviction at first and the hash table expands, it gets difficult for unnecessary
orless frequently accessed tuples to be removed because it will become longer and longer until the next hash table
expansion. The hash table doubles in size each time.
 
For example, if many transactions are executed in a short duration that create and drop temporary tables and indexes,
thehash table could become large quickly.
 

2) syscache_prune_min_age is difficult to set to meet contradictory requirements.
e.g., in the above temporary objects case, the user wants to shorten syscache_prune_min_age so that the catalog tuples
fortemporary objects are removed.  But that also is likely to result in the necessary catalog tuples for non-temporary
objectsbeing removed.
 

3) The DBA cannot control the memory usage.  It's not predictable.
syscache_memory_target doesn't set the limit on memory usage despite the impression from its name.  In general, the
cacheshould be able to set the upper limit on its size so that the DBA can manage things within a given amount of
memory. I think other PostgreSQL parameters are based on that idea -- shared_buffers, wal_buffers, work_mem,
temp_buffers,etc.
 

4) The memory usage doesn't decrease once allocated.
The normal allocation memory context, aset.c, which CacheMemoryContextuses, doesn't return pfree()d memory to the
operatingsystem.  Once CacheMemoryContext becomes big, it won't get smaller.
 

5) Catcaches are managed independently of each other.
Even if there are many unnecessary catalog tuples in one catcache, they are not freed to make room for other
catcaches.


So, why don't we make syscache_memory_target the upper limit on the total size of all catcaches, and rethink the past
LRUmanagement?
 


Regards
Takayuki Tsunakawa




Re: Protect syscache from bloating with negative cache entries

От
"bruce@momjian.us"
Дата:
On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:
> Horiguchi-san, Bruce, all, So, why don't we make
> syscache_memory_target the upper limit on the total size of all
> catcaches, and rethink the past LRU management?

I was going to say that our experience with LRU has been that the
overhead is not worth the value, but that was in shared resource cases,
which this is not.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: bruce@momjian.us [mailto:bruce@momjian.us]
> On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:
> > Horiguchi-san, Bruce, all, So, why don't we make
> > syscache_memory_target the upper limit on the total size of all
> > catcaches, and rethink the past LRU management?
> 
> I was going to say that our experience with LRU has been that the
> overhead is not worth the value, but that was in shared resource cases,
> which this is not.

That's good news!  Then, let's proceed with the approach involving LRU, Horiguchi-san, Ideriha-san.


Regards
Takayuki Tsunakawa





RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: bruce@momjian.us [mailto:bruce@momjian.us]
>On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:
>> Horiguchi-san, Bruce, all, So, why don't we make
>> syscache_memory_target the upper limit on the total size of all
>> catcaches, and rethink the past LRU management?
>
>I was going to say that our experience with LRU has been that the overhead is not
>worth the value, but that was in shared resource cases, which this is not.

One idea is building list with access counter for implementing LRU list based on this current patch.
The list is ordered by last access time. When a catcache entry is referenced, the list is maintained
, which is just manipulation of pointers at several times.
As Bruce mentioned, it's not shared so there is no cost related to lock contention.

When it comes to pruning, the cache older than certain timestamp with zero access counter is pruned.
This way would improve performance because it only scans limited range (bounded by sys_cache_min_age).  
Current patch scans all hash entries and check each timestamp which would decrease the performance as cache size
grows.
I'm thinking hopefully implementing this idea and measuring the performance. 

And when we want to set the memory size limit as Tsunakawa san said, the LRU list would be suitable.

Regards,
Takeshi Ideriha



Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:
Hi,

I find it a bit surprising there are almost no results demonstrating the
impact of the proposed changes on some typical workloads. It touches
code (syscache, ...) that is quite sensitive performance-wise, and
adding even just a little bit of overhead may hurt significantly. Even
on systems that don't have issues with cache bloat, etc.

I think this is something we need - benchmarks measuring the overhead on
a bunch of workloads (both typical and corner cases). Especially when
there was a limit on cache size in the past, and it was removed because
it was too expensive / hurting in some cases. I can't imagine committing
any such changes without this information.

This is particularly important as the patch was about one particular
issue (bloat due to negative entries) initially, but then the scope grew
quite a it. AFAICS the thread now talks about these workloads:

* negative entries (due to search_path lookups etc.)
* many tables accessed randomly
* many tables with only a small subset accessed frequently
* many tables with subsets accessed in subsets (due to pooling)
* ...

Unfortunately, some of those cases seems somewhat contradictory (i.e.
what works for one hurts the other), so I doubt it's possible to improve
all of them at once. But that makes the bencharking even more important.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:
On 1/21/19 9:56 PM, Bruce Momjian wrote:
> On Fri, Jan 18, 2019 at 05:09:41PM -0800, Andres Freund wrote:
>> Hi,
>>
>> On 2019-01-18 19:57:03 -0500, Robert Haas wrote:
>>> On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote:
>>>> My proposal for this was to attach a 'generation' to cache entries. Upon
>>>> access cache entries are marked to be of the current
>>>> generation. Whenever existing memory isn't sufficient for further cache
>>>> entries and, on a less frequent schedule, triggered by a timer, the
>>>> cache generation is increased and th new generation's "creation time" is
>>>> measured.  Then generations that are older than a certain threshold are
>>>> purged, and if there are any, the entries of the purged generation are
>>>> removed from the caches using a sequential scan through the cache.
>>>>
>>>> This outline achieves:
>>>> - no additional time measurements in hot code paths
>>>> - no need for a sequential scan of the entire cache when no generations
>>>>   are too old
>>>> - both size and time limits can be implemented reasonably cheaply
>>>> - overhead when feature disabled should be close to zero
>>>
>>> Seems generally reasonable.  The "whenever existing memory isn't
>>> sufficient for further cache entries" part I'm not sure about.
>>> Couldn't that trigger very frequently and prevent necessary cache size
>>> growth?
>>
>> I'm thinking it'd just trigger a new generation, with it's associated
>> "creation" time (which is cheap to acquire in comparison to creating a
>> number of cache entries) . Depending on settings or just code policy we
>> can decide up to which generation to prune the cache, using that
>> creation time.  I'd imagine that we'd have some default cache-pruning
>> time in the minutes, and for workloads where relevant one can make
>> sizing configurations more aggressive - or something like that.
> 
> OK, so it seems everyone likes the idea of a timer.  The open questions
> are whether we want multiple epochs, and whether we want some kind of
> size trigger.
> 

FWIW I share the with that time-based eviction (be it some sort of
timestamp or epoch) seems promising, seems cheaper than pretty much any
other LRU metric (requiring usage count / clock sweep / ...).

> With only one time epoch, if the timer is 10 minutes, you could expire an
> entry after 10-19 minutes, while with a new epoch every minute and
> 10-minute expire, you can do 10-11 minute precision.  I am not sure the
> complexity is worth it.
> 

I don't think having just a single epoch would be significantly less
complex than having more of them. In fact, having more of them might
make it actually cheaper.


> For a size trigger, should removal be effected by how many expired cache
> entries there are?  If there were 10k expired entries or 50, wouldn't
> you want them removed if they have not been accessed in X minutes?
> 
> In the worst case, if 10k entries were accessed in a query and never
> accessed again, what would the ideal cleanup behavior be?  Would it
> matter if it was expired in 10 or 19 minutes?  Would it matter if there
> were only 50 entries?
> 

I don't think we need to remove the expired entries right away, if there
are only very few of them. The cleanup requires walking the hash table,
which means significant fixed cost. So if there are only few expired
entries (say, less than 25% of the cache), we can just leave them around
and clean them if we happen to stumble on them (although that may not be
possible with dynahash, which has no concept of expiration) of before
enlarging the hash table.

FWIW when it comes to memory consumption, it's important to realize the
cache memory context won't release the memory to the system, even if we
remove the expired entries. It'll simply stash them into a freelist.
That's OK when the entries are to be reused, but the memory usage won't
decrease after a sudden spike for example (and there may be other chunks
allocated on the same page, so paging it out will hurt).

So if we want to address this case too (and we probably want), we may
need to discard the old cache memory context someho (e.g. rebuild the
cache in a new one, and copy the non-expired entries). Which is a nice
opportunity to do the "full" cleanup, of course.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Alvaro Herrera
Дата:
On 2019-Feb-05, Tomas Vondra wrote:

> I don't think we need to remove the expired entries right away, if there
> are only very few of them. The cleanup requires walking the hash table,
> which means significant fixed cost. So if there are only few expired
> entries (say, less than 25% of the cache), we can just leave them around
> and clean them if we happen to stumble on them (although that may not be
> possible with dynahash, which has no concept of expiration) of before
> enlarging the hash table.

I think seqscanning the hash table is going to be too slow; Ideriha-san
idea of having a dlist with the entries in LRU order (where each entry
is moved to head of list when it is touched) seemed good: it allows you
to evict older ones when the time comes, without having to scan the rest
of the entries.  Having a dlist means two more pointers on each cache
entry AFAIR, so it's not a huge amount of memory.

> So if we want to address this case too (and we probably want), we may
> need to discard the old cache memory context someho (e.g. rebuild the
> cache in a new one, and copy the non-expired entries). Which is a nice
> opportunity to do the "full" cleanup, of course.

Yeah, we probably don't want to do this super frequently though.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:
On 2/5/19 11:05 PM, Alvaro Herrera wrote:
> On 2019-Feb-05, Tomas Vondra wrote:
> 
>> I don't think we need to remove the expired entries right away, if there
>> are only very few of them. The cleanup requires walking the hash table,
>> which means significant fixed cost. So if there are only few expired
>> entries (say, less than 25% of the cache), we can just leave them around
>> and clean them if we happen to stumble on them (although that may not be
>> possible with dynahash, which has no concept of expiration) of before
>> enlarging the hash table.
> 
> I think seqscanning the hash table is going to be too slow; Ideriha-san
> idea of having a dlist with the entries in LRU order (where each entry
> is moved to head of list when it is touched) seemed good: it allows you
> to evict older ones when the time comes, without having to scan the rest
> of the entries.  Having a dlist means two more pointers on each cache
> entry AFAIR, so it's not a huge amount of memory.
> 

Possibly, although my guess is it will depend on the number of entries
to remove. For small number of entries, the dlist approach is going to
be faster, but at some point the bulk seqscan gets more efficient.

FWIW this is exactly where a bit of benchmarking would help.

>> So if we want to address this case too (and we probably want), we may
>> need to discard the old cache memory context someho (e.g. rebuild the
>> cache in a new one, and copy the non-expired entries). Which is a nice
>> opportunity to do the "full" cleanup, of course.
> 
> Yeah, we probably don't want to do this super frequently though.
> 

Right. I've also realized the resizing is built into dynahash and is
kinda incremental - we add (and split) buckets one by one, instead of
immediately rebuilding the whole hash table. So yes, this would need
more care and might need to interact with dynahash in some way.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Tue, 5 Feb 2019 02:40:35 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1FB93A16@G01JPEXMBYT05>
> From: bruce@momjian.us [mailto:bruce@momjian.us]
> > On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:
> > > Horiguchi-san, Bruce, all, So, why don't we make
> > > syscache_memory_target the upper limit on the total size of all
> > > catcaches, and rethink the past LRU management?
> > 
> > I was going to say that our experience with LRU has been that the
> > overhead is not worth the value, but that was in shared resource cases,
> > which this is not.
> 
> That's good news!  Then, let's proceed with the approach involving LRU, Horiguchi-san, Ideriha-san.

If you mean accessed-time-ordered list of entries by "LRU", I
still object to involve it since it is too complex in searching
code paths. Invalidation would make things more complex. The
current patch sorts entries by ct->lastaccess and discards
entries not accessed for more than threshold, only at doubling
cache capacity. It is already a kind of LRU in behavior.

This patch intends not to let caches bloat by unnecessary
entries, which is negative ones at first, then less-accessed ones
currently. If you mean by "LRU" something to put a hard limit on
the number or size of a catcache or all caches, it would be
doable by adding sort phase before pruning, like
CatCacheCleanOldEntriesByNum() in the attached as a PoC (first
attched) as food for discussion.

With the second attached script, we can observe what is happening
from another session by the following query.

select relname, size, ntuples, ageclass from pg_stat_syscache where relname =' pg_statistic'::regclass;

> pg_statistic | 1041024 |    7109 | {{1,1109},{3,0},{30,0},{60,0},{90,6000},{0,0

On the other hand, differently from the original pruning, this
happens irrelevantly to hash resize so it will causes another
observable intermittent slowdown than rehashing.

The two should have the same extent of impact on performance when
disabled. I'll take numbers briefly using pgbench.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 21f7b5528be03274dae9e58690c35cee9e68c82f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   5 +
 src/backend/utils/cache/catcache.c            | 166 ++++++++++++++++++++++++--
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  28 ++++-
 6 files changed, 254 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9b7a7388d5..d0d2374944 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92bda87804..ddc433c59e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -734,7 +734,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     if (!IsParallelWorker())
+    {
         stmtStartTimestamp = GetCurrentTimestamp();
+
+        /* Set this timestamp as aproximated current time */
+        SetCatCacheClock(stmtStartTimestamp);
+    }
     else
         Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 258a1d64cc..2a996d740a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -490,6 +505,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -841,6 +857,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -858,9 +875,129 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else
+                {
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+                    }
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1274,6 +1411,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1819,11 +1961,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1842,13 +1986,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1876,8 +2021,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1898,17 +2043,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8681ada33a..06c589f725 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -81,6 +81,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2204,6 +2205,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c7f53470df..108d332f2c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..5d24809900 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +193,28 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 9f243e2fa6c6aaa5e333662f63c28c18ea72ed0f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 15:48:28 +0900
Subject: [PATCH 2/4] Syscache usage tracking feature.

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  15 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 201 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            | 115 +++++++++++----
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   6 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 576 insertions(+), 45 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d0d2374944..5ff3ebeb4e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval">
+      <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect system cache usage statistics in
+        milliseconds. This parameter is 0 by default, which means disabled.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..f5d1aaf96f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..a1939958b7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -124,6 +125,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only databsae stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_remove_syscache_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -630,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /* clear syscache statistics files and temprary settings */
+    if (MyBackendId != InvalidBackendId)
+        pgstat_remove_syscache_statsfile();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* removes syscache stats files of this backend */
+static void
+pgstat_remove_syscache_statsfile(void)
+{
+    char    fname[MAXPGPATH];
+
+    pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                     fname, MAXPGPATH);
+    unlink(fname);        /* don't care of the result */
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and resturns the
+ * time remaining in the current interval in milliseconds. If'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    {
+        /* disabled. remove the statistics file if any */
+        if (last_report > 0)
+        {
+            last_report = 0;
+            pgstat_remove_syscache_statsfile();
+        }
+        return 0;
+    }
+    
+    /* Check aginst the interval */
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now update the stats */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cfd507b2..fb77a0ce4c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3157,6 +3157,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 }
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_catcache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_catcache_update_timeout = true;
+                    enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_catcache_update_timeout)
+        {
+            disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false);
+            disable_idle_catcache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..6526cfefb4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        fread(&cacheid, sizeof(int), 1, fpin);
+        fread(&last_update, sizeof(TimestampTz), 1, fpin);
+        if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 2a996d740a..4ccda06795 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Timestamp used for any operation on caches. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -619,9 +623,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -697,9 +699,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -906,10 +906,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
      * cache_prune_min_age. The index of nremoved_entry is the value of the
      * clock-sweep counter, which takes from 0 up to 2.
      */
-    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
-    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0};
     int            nremoved_entry[3] = {0, 0, 0};
     int            j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
 #endif
 
     /* Return immediately if no pruning is wanted */
@@ -923,7 +924,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
     
-    /* Search the whole hash for entries to remove */
+    /*
+     * Search the whole hash for entries to remove. This is a quite time
+     * consuming task during catcache lookup, but accetable since now we are
+     * going to expand the hash table.
+     */
     for (i = 0; i < cp->cc_nbuckets; i++)
     {
         dlist_mutable_iter iter;
@@ -936,21 +941,21 @@ CatCacheCleanupOldEntries(CatCache *cp)
 
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
 #ifdef CATCACHE_STATS
             /* count catcache entries for each age class */
             ntotal++;
-            for (j = 0 ;
-                 ageclass[j] != 0.0 &&
-                     entry_age > cache_prune_min_age * ageclass[j] ;
-                 j++);
-            if (ageclass[j] == 0.0) j--;
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > cache_prune_min_age * ageclass[j])
+                j++;
             nentries[j]++;
 #endif
 
@@ -983,14 +988,17 @@ CatCacheCleanupOldEntries(CatCache *cp)
     }
 
 #ifdef CATCACHE_STATS
+    StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6,
+                     "number of syscache age class must be 6");
     ereport(DEBUG1,
-            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d)
naccessed(0:%d,1:%d, 2:%d)",
 
                      nremoved, ntotal,
                      ageclass[0] * cache_prune_min_age, nentries[0],
                      ageclass[1] * cache_prune_min_age, nentries[1],
                      ageclass[2] * cache_prune_min_age, nentries[2],
                      ageclass[3] * cache_prune_min_age, nentries[3],
                      ageclass[4] * cache_prune_min_age, nentries[4],
+                     nentries[5],
                      nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
              errhidestmt(true)));
 #endif
@@ -1367,9 +1375,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1429,9 +1435,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1440,9 +1444,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1570,9 +1572,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1683,9 +1683,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1742,9 +1740,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2252,3 +2248,64 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /* cache_prune_min_age can be changed on-session, fill it every time */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..f039ecd805 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index c0b6231458..dee7f19475 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 06c589f725..32e41253a6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3168,6 +3168,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 108d332f2c..4d4fb42251 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -560,6 +560,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_syscache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b8de13f03b..6099a828d2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9669,6 +9669,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..69b9a976f0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..b6bfd7d644 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 5d24809900..4d51975920 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -65,10 +65,8 @@ typedef struct catcache
     int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -81,7 +79,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..0ab441a364 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_CATCACHE_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2c8e21baa7..7bd77e9972 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3

From 5be729e44acf3f9c94dd9d13fa84cb4ae598406f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 6 Feb 2019 14:36:29 +0900
Subject: [PATCH 3/4] PoC add prune-by-number-of-entries feature

Adds prune based on the number of cache entries on top of the current
pruning patch. It is controlled by two GUC variables.

cache_entry_limit: limit of the number of entries per catcache
cache_entry_limit_prune_ratio: how much of entries to remove at pruning
---
 src/backend/utils/cache/catcache.c | 100 ++++++++++++++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c       |  40 +++++++++++++++
 src/include/utils/catcache.h       |   2 +
 3 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 4ccda06795..d15eac87d8 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -77,6 +77,11 @@
  */
 int cache_memory_target = 0;
 
+
+/* PoC entry limit */
+int cache_entry_limit = 0;
+double cache_entry_limit_prune_ratio = 0.8;
+
 /* GUC variable to define the minimum age of entries that will be cosidered to
  * be evicted in seconds. This variable is shared among various cache
  * mechanisms.
@@ -882,6 +887,95 @@ InitCatCache(int id,
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntriesByNum - 
+ *    Poc remove infrequently-used entries by number of entries.
+ */
+static bool
+CatCacheCleanupOldEntriesByNum(CatCache *cp, int cache_entry_limit)
+{
+    int            i;
+    int         n;
+    int            oldndelelem = cp->cc_ntup;
+    int            ndelelem;
+    CatCTup        **ct_array;
+
+    ndelelem = oldndelelem - (int)(cache_entry_limit * cache_entry_limit_prune_ratio);
+
+    /* lower limit: quite arbitrary */
+    if (ndelelem < 256)
+        ndelelem = 256;
+
+    /*
+     * partial sort array: [0] contains latest access entry
+     *                     [1] contains ealiest access entry
+     */
+    ct_array = (CatCTup **) palloc(ndelelem * sizeof(CatCTup*));
+    n = 0;
+
+    /*
+     * Collect entries to be removed, which have older lastaccess.
+     * Using heap bound sort like tuplesort.c.
+     */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            if (n < ndelelem)
+            {
+                int j = n++;
+
+                while (j > 0)
+                {
+                    int i = (j - 1) >> 1;
+
+                    if (ct->lastaccess >= ct_array[i]->lastaccess)
+                        break;
+                    ct_array[j] = ct_array[i];
+                    j = i;
+                }
+                ct_array[j] = ct;
+            }
+            else if (ct->lastaccess > ct_array[0]->lastaccess)
+            {
+                unsigned int i;
+
+                i = 0;
+
+                for (;;)
+                {
+                    unsigned int j = 2 * i + 1;
+
+                    if (j >= n)
+                        break;
+                    if (j + 1 < n &&
+                        ct_array[j]->lastaccess > ct_array[j + 1]->lastaccess)
+                        j++;
+                    if (ct->lastaccess <= ct_array[j]->lastaccess)
+                        break;
+                    ct_array[i] = ct_array[j];
+                    i = j;
+                }
+                ct_array[i] = ct;
+            }
+        }
+    }
+
+    /* Now we have the list of elements to be deleted */
+    for (i = 0 ; i < ndelelem ; i++)
+        CatCacheRemoveCTup(cp, ct_array[i]);
+
+    pfree(ct_array);
+
+    elog(LOG, "Catcache pruned by entry number: id=%d, %d => %d", cp->id, oldndelelem, cp->cc_ntup);
+
+    return true;
+}
+
 /*
  * CatCacheCleanupOldEntries - Remove infrequently-used entries
  *
@@ -923,7 +1017,7 @@ CatCacheCleanupOldEntries(CatCache *cp)
     hash_size = cp->cc_nbuckets * sizeof(dlist_head);
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
-    
+
     /*
      * Search the whole hash for entries to remove. This is a quite time
      * consuming task during catcache lookup, but accetable since now we are
@@ -2049,6 +2143,10 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CacheHdr->ch_ntup++;
     cache->cc_tupsize += tupsize;
 
+    /* cap number of entries */
+    if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit)
+        CatCacheCleanupOldEntriesByNum(cache, cache_entry_limit);
+    
     /*
      * If the hash table has become too full, try cleanup by removing
      * infrequently used entries to make a room for the new entry. If it
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 32e41253a6..7bb239a07e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2227,6 +2227,36 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
@@ -3401,6 +3431,16 @@ static struct config_real ConfigureNamesReal[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit_prune_ratio,
+        0.8, 0.0, 1.0,
+        NULL, NULL, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 4d51975920..1f7fb51ac0 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -193,6 +193,8 @@ extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 /* for guc.c, not PGDLLPMPORT'ed */
 extern int cache_prune_min_age;
 extern int cache_memory_target;
+extern int cache_entry_limit;
+extern double cache_entry_limit_prune_ratio;
 
 /* to use as access timestamp of catcache entries */
 extern TimestampTz catcacheclock;
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Wed, 06 Feb 2019 14:43:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190206.144334.193118280.horiguchi.kyotaro@lab.ntt.co.jp>
> At Tue, 5 Feb 2019 02:40:35 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1FB93A16@G01JPEXMBYT05>
> > From: bruce@momjian.us [mailto:bruce@momjian.us]
> > > On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote:
> > > > Horiguchi-san, Bruce, all, So, why don't we make
> > > > syscache_memory_target the upper limit on the total size of all
> > > > catcaches, and rethink the past LRU management?
> > > 
> > > I was going to say that our experience with LRU has been that the
> > > overhead is not worth the value, but that was in shared resource cases,
> > > which this is not.
> > 
> > That's good news!  Then, let's proceed with the approach involving LRU, Horiguchi-san, Ideriha-san.
> 
> If you mean accessed-time-ordered list of entries by "LRU", I
> still object to involve it since it is too complex in searching
> code paths. Invalidation would make things more complex. The
> current patch sorts entries by ct->lastaccess and discards
> entries not accessed for more than threshold, only at doubling
> cache capacity. It is already a kind of LRU in behavior.
> 
> This patch intends not to let caches bloat by unnecessary
> entries, which is negative ones at first, then less-accessed ones
> currently. If you mean by "LRU" something to put a hard limit on
> the number or size of a catcache or all caches, it would be
> doable by adding sort phase before pruning, like
> CatCacheCleanOldEntriesByNum() in the attached as a PoC (first
> attched) as food for discussion.
> 
> With the second attached script, we can observe what is happening
> from another session by the following query.
> 
> select relname, size, ntuples, ageclass from pg_stat_syscache where relname =' pg_statistic'::regclass;
> 
> > pg_statistic | 1041024 |    7109 | {{1,1109},{3,0},{30,0},{60,0},{90,6000},{0,0
> 
> On the other hand, differently from the original pruning, this
> happens irrelevantly to hash resize so it will causes another
> observable intermittent slowdown than rehashing.
> 
> The two should have the same extent of impact on performance when
> disabled. I'll take numbers briefly using pgbench.

Sorry, I forgot to consider references in the previous patch, and
attach the test script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 952497a1fad57ac49e0b772a147201aa31065183 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   5 +
 src/backend/utils/cache/catcache.c            | 164 ++++++++++++++++++++++++--
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  28 ++++-
 6 files changed, 252 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9b7a7388d5..d0d2374944 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92bda87804..ddc433c59e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -734,7 +734,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     if (!IsParallelWorker())
+    {
         stmtStartTimestamp = GetCurrentTimestamp();
+
+        /* Set this timestamp as aproximated current time */
+        SetCatCacheClock(stmtStartTimestamp);
+    }
     else
         Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 258a1d64cc..769e173844 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -490,6 +505,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -841,6 +857,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -858,9 +875,127 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else if (ct->refcount == 0 &&
+                         (!ct->c_list || ct->c_list->refcount == 0))
+                {
+                    CatCacheRemoveCTup(cp, ct);
+                    nremoved++;
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1274,6 +1409,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1819,11 +1959,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1842,13 +1984,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1876,8 +2019,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1898,17 +2041,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
     return ct;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8681ada33a..06c589f725 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -81,6 +81,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2204,6 +2205,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c7f53470df..108d332f2c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..5d24809900 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +193,28 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 19f9ec0a86f9d0a86e54a39188dd8e75a7d8061a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 15:48:28 +0900
Subject: [PATCH 2/4] Syscache usage tracking feature.

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  15 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 201 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            | 115 +++++++++++----
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   6 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 576 insertions(+), 45 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d0d2374944..5ff3ebeb4e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval">
+      <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect system cache usage statistics in
+        milliseconds. This parameter is 0 by default, which means disabled.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..f5d1aaf96f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..a1939958b7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -124,6 +125,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only databsae stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_remove_syscache_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -630,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /* clear syscache statistics files and temprary settings */
+    if (MyBackendId != InvalidBackendId)
+        pgstat_remove_syscache_statsfile();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* removes syscache stats files of this backend */
+static void
+pgstat_remove_syscache_statsfile(void)
+{
+    char    fname[MAXPGPATH];
+
+    pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                     fname, MAXPGPATH);
+    unlink(fname);        /* don't care of the result */
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and resturns the
+ * time remaining in the current interval in milliseconds. If'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    {
+        /* disabled. remove the statistics file if any */
+        if (last_report > 0)
+        {
+            last_report = 0;
+            pgstat_remove_syscache_statsfile();
+        }
+        return 0;
+    }
+    
+    /* Check aginst the interval */
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now update the stats */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cfd507b2..fb77a0ce4c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3157,6 +3157,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 }
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_catcache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_catcache_update_timeout = true;
+                    enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_catcache_update_timeout)
+        {
+            disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false);
+            disable_idle_catcache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..6526cfefb4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        fread(&cacheid, sizeof(int), 1, fpin);
+        fread(&last_update, sizeof(TimestampTz), 1, fpin);
+        if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 769e173844..1da1589a5d 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Timestamp used for any operation on caches. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -619,9 +623,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -697,9 +699,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -906,10 +906,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
      * cache_prune_min_age. The index of nremoved_entry is the value of the
      * clock-sweep counter, which takes from 0 up to 2.
      */
-    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
-    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0};
     int            nremoved_entry[3] = {0, 0, 0};
     int            j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
 #endif
 
     /* Return immediately if no pruning is wanted */
@@ -923,7 +924,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
     
-    /* Search the whole hash for entries to remove */
+    /*
+     * Search the whole hash for entries to remove. This is a quite time
+     * consuming task during catcache lookup, but accetable since now we are
+     * going to expand the hash table.
+     */
     for (i = 0; i < cp->cc_nbuckets; i++)
     {
         dlist_mutable_iter iter;
@@ -936,21 +941,21 @@ CatCacheCleanupOldEntries(CatCache *cp)
 
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
 #ifdef CATCACHE_STATS
             /* count catcache entries for each age class */
             ntotal++;
-            for (j = 0 ;
-                 ageclass[j] != 0.0 &&
-                     entry_age > cache_prune_min_age * ageclass[j] ;
-                 j++);
-            if (ageclass[j] == 0.0) j--;
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > cache_prune_min_age * ageclass[j])
+                j++;
             nentries[j]++;
 #endif
 
@@ -981,14 +986,17 @@ CatCacheCleanupOldEntries(CatCache *cp)
     }
 
 #ifdef CATCACHE_STATS
+    StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6,
+                     "number of syscache age class must be 6");
     ereport(DEBUG1,
-            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d)
naccessed(0:%d,1:%d, 2:%d)",
 
                      nremoved, ntotal,
                      ageclass[0] * cache_prune_min_age, nentries[0],
                      ageclass[1] * cache_prune_min_age, nentries[1],
                      ageclass[2] * cache_prune_min_age, nentries[2],
                      ageclass[3] * cache_prune_min_age, nentries[3],
                      ageclass[4] * cache_prune_min_age, nentries[4],
+                     nentries[5],
                      nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
              errhidestmt(true)));
 #endif
@@ -1365,9 +1373,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1427,9 +1433,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1438,9 +1442,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1568,9 +1570,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1681,9 +1681,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1740,9 +1738,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2250,3 +2246,64 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /* cache_prune_min_age can be changed on-session, fill it every time */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..f039ecd805 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index c0b6231458..dee7f19475 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 06c589f725..32e41253a6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3168,6 +3168,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 108d332f2c..4d4fb42251 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -560,6 +560,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_syscache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b8de13f03b..6099a828d2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9669,6 +9669,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..69b9a976f0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..b6bfd7d644 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 5d24809900..4d51975920 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -65,10 +65,8 @@ typedef struct catcache
     int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -81,7 +79,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..0ab441a364 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_CATCACHE_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2c8e21baa7..7bd77e9972 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3

From 83444ebafff25babd94c48080b5ba420a27db430 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 6 Feb 2019 14:36:29 +0900
Subject: [PATCH 3/4] PoC add prune-by-number-of-entries feature

Adds prune based on the number of cache entries on top of the current
pruning patch. It is controlled by two GUC variables.

cache_entry_limit: limit of the number of entries per catcache
cache_entry_limit_prune_ratio: how much of entries to remove at pruning
---
 src/backend/utils/cache/catcache.c | 107 ++++++++++++++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c       |  40 ++++++++++++++
 src/include/utils/catcache.h       |   2 +
 3 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 1da1589a5d..70ae5da988 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -77,6 +77,11 @@
  */
 int cache_memory_target = 0;
 
+
+/* PoC entry limit */
+int cache_entry_limit = 0;
+double cache_entry_limit_prune_ratio = 0.8;
+
 /* GUC variable to define the minimum age of entries that will be cosidered to
  * be evicted in seconds. This variable is shared among various cache
  * mechanisms.
@@ -882,6 +887,102 @@ InitCatCache(int id,
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntriesByNum - 
+ *    Poc remove infrequently-used entries by number of entries.
+ */
+static bool
+CatCacheCleanupOldEntriesByNum(CatCache *cp, int cache_entry_limit)
+{
+    int            i;
+    int         n;
+    int            oldndelelem = cp->cc_ntup;
+    int            ndelelem;
+    CatCTup        **ct_array;
+
+    ndelelem = oldndelelem - (int)(cache_entry_limit * cache_entry_limit_prune_ratio);
+
+    /* lower limit: quite arbitrary */
+    if (ndelelem < 256)
+        ndelelem = 256;
+
+    /*
+     * partial sort array: [0] contains latest access entry
+     *                     [1] contains ealiest access entry
+     */
+    ct_array = (CatCTup **) palloc(ndelelem * sizeof(CatCTup*));
+    n = 0;
+
+    /*
+     * Collect entries to be removed, which have older lastaccess.
+     * Using heap bound sort like tuplesort.c.
+     */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount != 0 ||
+                (ct->c_list && ct->c_list->refcount != 0))
+                continue;
+
+            if (n < ndelelem)
+            {
+                /* Fill up the min heap array */
+                int j = n++;
+
+                while (j > 0)
+                {
+                    int i = (j - 1) >> 1;
+
+                    if (ct->lastaccess >= ct_array[i]->lastaccess)
+                        break;
+                    ct_array[j] = ct_array[i];
+                    j = i;
+                }
+                ct_array[j] = ct;
+            }
+            else if (ct->lastaccess > ct_array[0]->lastaccess)
+            {
+                /* older than the oldest in the array, add it */
+                unsigned int i;
+
+                i = 0;
+
+                for (;;)
+                {
+                    unsigned int j = 2 * i + 1;
+
+                    if (j >= n)
+                        break;
+                    if (j + 1 < n &&
+                        ct_array[j]->lastaccess > ct_array[j + 1]->lastaccess)
+                        j++;
+                    if (ct->lastaccess <= ct_array[j]->lastaccess)
+                        break;
+                    ct_array[i] = ct_array[j];
+                    i = j;
+                }
+                ct_array[i] = ct;
+            }
+        }
+    }
+
+    /* Now we have the list of elements to be deleted */
+    for (i = 0 ; i < n && ct_array[i]; i++)
+        CatCacheRemoveCTup(cp, ct_array[i]);
+
+    pfree(ct_array);
+
+    elog(LOG, "Catcache pruned by entry number: id=%d, %d => %d", cp->id, oldndelelem, cp->cc_ntup);
+
+    return true;
+}
+
 /*
  * CatCacheCleanupOldEntries - Remove infrequently-used entries
  *
@@ -923,7 +1024,7 @@ CatCacheCleanupOldEntries(CatCache *cp)
     hash_size = cp->cc_nbuckets * sizeof(dlist_head);
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
-    
+
     /*
      * Search the whole hash for entries to remove. This is a quite time
      * consuming task during catcache lookup, but accetable since now we are
@@ -2047,6 +2148,10 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CacheHdr->ch_ntup++;
     cache->cc_tupsize += tupsize;
 
+    /* cap number of entries */
+    if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit)
+        CatCacheCleanupOldEntriesByNum(cache, cache_entry_limit);
+    
     /*
      * If the hash table has become too full, try cleanup by removing
      * infrequently used entries to make a room for the new entry. If it
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 32e41253a6..7bb239a07e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2227,6 +2227,36 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
@@ -3401,6 +3431,16 @@ static struct config_real ConfigureNamesReal[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit_prune_ratio,
+        0.8, 0.0, 1.0,
+        NULL, NULL, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 4d51975920..1f7fb51ac0 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -193,6 +193,8 @@ extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 /* for guc.c, not PGDLLPMPORT'ed */
 extern int cache_prune_min_age;
 extern int cache_memory_target;
+extern int cache_entry_limit;
+extern double cache_entry_limit_prune_ratio;
 
 /* to use as access timestamp of catcache entries */
 extern TimestampTz catcacheclock;
-- 
2.16.3

#! /usr/bin/perl

print "set track_syscache_usage_interval to 1000;\n";

## for time-based pruning
#print "set cache_prune_min_age to '5s';\n";
#print "set cache_memory_target to '0';\n";

## for limit-based pruning
print "set cache_memory_target to '100MB';\n";
print "set cache_entry_limit to 10000;\n";
print "set cache_entry_limit_prune_ratio to 0.6;\n";

while (1) {
    print "begin; create temp table t1 (a int, b int, c int, d int, e int, f int, g int, h int, i int, j int) on commit
drop;insert into t1 values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); select * from t1; commit;\n";
 
}

Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Wed, 06 Feb 2019 15:16:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190206.151653.117382256.horiguchi.kyotaro@lab.ntt.co.jp>
> > The two should have the same extent of impact on performance when
> > disabled. I'll take numbers briefly using pgbench.

(pgbench -j 10 -c 10 -T 120) x 5 times for each.

A: unpached             : 118.58 tps (stddev 0.44)
B: pached-not-used[1]   : 118.41 tps (stddev 0.29)
C: patched-timedprune[2]: 118.41 tps (stddev 0.51)
D: patched-capped...... : none[3]

[1]: cache_prune_min_age = 0, cache_entry_limit = 0

[2]: cache_prune_min_age = 100, cache_entry_limit = 0
     (Prunes every 100ms)

[3] I didin't find a sane benchmark for the capping case using
    vanilla pgbench.

It doesn't seem to me showing significant degradation on *my*
box...

# I found a bug that can remove newly created entry. So v11.

regards.
-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 288f499393a1b6dd8c37781205fd7e553974fa1d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   5 +
 src/backend/utils/cache/catcache.c            | 168 ++++++++++++++++++++++++--
 src/backend/utils/misc/guc.c                  |  23 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  28 ++++-
 6 files changed, 256 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9b7a7388d5..d0d2374944 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92bda87804..ddc433c59e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -734,7 +734,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     if (!IsParallelWorker())
+    {
         stmtStartTimestamp = GetCurrentTimestamp();
+
+        /* Set this timestamp as aproximated current time */
+        SetCatCacheClock(stmtStartTimestamp);
+    }
     else
         Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 258a1d64cc..5106ed896a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,24 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -490,6 +505,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -841,6 +857,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -858,9 +875,127 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            i;
+    int            nremoved = 0;
+    size_t        hash_size;
+#ifdef CATCACHE_STATS
+    /* These variables are only for debugging purpose */
+    int            ntotal = 0;
+    /*
+     * nth element in nentries stores the number of cache entries that have
+     * lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age. The index of nremoved_entry is the value of the
+     * clock-sweep counter, which takes from 0 up to 2.
+     */
+    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nremoved_entry[3] = {0, 0, 0};
+    int            j;
+#endif
+
+    /* Return immediately if no pruning is wanted */
+    if (cache_prune_min_age < 0)
+        return false;
+
+    /*
+     * Return without pruning if the size of the hash is below the target.
+     */
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+    if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
+        return false;
+    
+    /* Search the whole hash for entries to remove */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+#ifdef CATCACHE_STATS
+            /* count catcache entries for each age class */
+            ntotal++;
+            for (j = 0 ;
+                 ageclass[j] != 0.0 &&
+                     entry_age > cache_prune_min_age * ageclass[j] ;
+                 j++);
+            if (ageclass[j] == 0.0) j--;
+            nentries[j]++;
+#endif
+
+            /*
+             * Try to remove entries older than cache_prune_min_age seconds.
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (entry_age > cache_prune_min_age)
+            {
+#ifdef CATCACHE_STATS
+                Assert (ct->naccess >= 0 && ct->naccess <= 2);
+                nremoved_entry[ct->naccess]++;
+#endif
+                if (ct->naccess > 0)
+                    ct->naccess--;
+                else if (ct->refcount == 0 &&
+                         (!ct->c_list || ct->c_list->refcount == 0))
+                {
+                    CatCacheRemoveCTup(cp, ct);
+                    nremoved++;
+                }
+            }
+        }
+    }
+
+#ifdef CATCACHE_STATS
+    ereport(DEBUG1,
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+                     nremoved, ntotal,
+                     ageclass[0] * cache_prune_min_age, nentries[0],
+                     ageclass[1] * cache_prune_min_age, nentries[1],
+                     ageclass[2] * cache_prune_min_age, nentries[2],
+                     ageclass[3] * cache_prune_min_age, nentries[3],
+                     ageclass[4] * cache_prune_min_age, nentries[4],
+                     nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
+             errhidestmt(true)));
+#endif
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1274,6 +1409,11 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1819,11 +1959,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1842,13 +1984,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1876,8 +2019,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1898,19 +2041,30 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
 
+    /* increase refcount so that this survives pruning */
+    ct->refcount++;
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
+    ct->refcount--;
+
     return ct;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8681ada33a..06c589f725 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -81,6 +81,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2204,6 +2205,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c7f53470df..108d332f2c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..5d24809900 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +193,28 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 1ee885cef5cc66a1246e4929954cdcc1949f162a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 15:48:28 +0900
Subject: [PATCH 2/4] Syscache usage tracking feature.

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  15 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 201 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            | 115 +++++++++++----
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   6 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 576 insertions(+), 45 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d0d2374944..5ff3ebeb4e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval">
+      <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect system cache usage statistics in
+        milliseconds. This parameter is 0 by default, which means disabled.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..f5d1aaf96f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..a1939958b7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -124,6 +125,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only databsae stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_remove_syscache_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -630,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /* clear syscache statistics files and temprary settings */
+    if (MyBackendId != InvalidBackendId)
+        pgstat_remove_syscache_statsfile();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* removes syscache stats files of this backend */
+static void
+pgstat_remove_syscache_statsfile(void)
+{
+    char    fname[MAXPGPATH];
+
+    pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                     fname, MAXPGPATH);
+    unlink(fname);        /* don't care of the result */
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and resturns the
+ * time remaining in the current interval in milliseconds. If'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    {
+        /* disabled. remove the statistics file if any */
+        if (last_report > 0)
+        {
+            last_report = 0;
+            pgstat_remove_syscache_statsfile();
+        }
+        return 0;
+    }
+    
+    /* Check aginst the interval */
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now update the stats */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cfd507b2..fb77a0ce4c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3157,6 +3157,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 }
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_catcache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_catcache_update_timeout = true;
+                    enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_catcache_update_timeout)
+        {
+            disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false);
+            disable_idle_catcache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..6526cfefb4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        fread(&cacheid, sizeof(int), 1, fpin);
+        fread(&last_update, sizeof(TimestampTz), 1, fpin);
+        if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 5106ed896a..950576fea0 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Timestamp used for any operation on caches. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -619,9 +623,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -697,9 +699,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -906,10 +906,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
      * cache_prune_min_age. The index of nremoved_entry is the value of the
      * clock-sweep counter, which takes from 0 up to 2.
      */
-    double        ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
-    int            nentries[] = {0, 0, 0, 0, 0, 0};
+    int            nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0};
     int            nremoved_entry[3] = {0, 0, 0};
     int            j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
 #endif
 
     /* Return immediately if no pruning is wanted */
@@ -923,7 +924,11 @@ CatCacheCleanupOldEntries(CatCache *cp)
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
     
-    /* Search the whole hash for entries to remove */
+    /*
+     * Search the whole hash for entries to remove. This is a quite time
+     * consuming task during catcache lookup, but accetable since now we are
+     * going to expand the hash table.
+     */
     for (i = 0; i < cp->cc_nbuckets; i++)
     {
         dlist_mutable_iter iter;
@@ -936,21 +941,21 @@ CatCacheCleanupOldEntries(CatCache *cp)
 
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
 #ifdef CATCACHE_STATS
             /* count catcache entries for each age class */
             ntotal++;
-            for (j = 0 ;
-                 ageclass[j] != 0.0 &&
-                     entry_age > cache_prune_min_age * ageclass[j] ;
-                 j++);
-            if (ageclass[j] == 0.0) j--;
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > cache_prune_min_age * ageclass[j])
+                j++;
             nentries[j]++;
 #endif
 
@@ -981,14 +986,17 @@ CatCacheCleanupOldEntries(CatCache *cp)
     }
 
 #ifdef CATCACHE_STATS
+    StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6,
+                     "number of syscache age class must be 6");
     ereport(DEBUG1,
-            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d,
2:%d)",
+            (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d)
naccessed(0:%d,1:%d, 2:%d)",
 
                      nremoved, ntotal,
                      ageclass[0] * cache_prune_min_age, nentries[0],
                      ageclass[1] * cache_prune_min_age, nentries[1],
                      ageclass[2] * cache_prune_min_age, nentries[2],
                      ageclass[3] * cache_prune_min_age, nentries[3],
                      ageclass[4] * cache_prune_min_age, nentries[4],
+                     nentries[5],
                      nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]),
              errhidestmt(true)));
 #endif
@@ -1365,9 +1373,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1427,9 +1433,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1438,9 +1442,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1568,9 +1570,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1681,9 +1681,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1740,9 +1738,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2254,3 +2250,64 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /* cache_prune_min_age can be changed on-session, fill it every time */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..f039ecd805 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index c0b6231458..dee7f19475 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 06c589f725..32e41253a6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3168,6 +3168,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 108d332f2c..4d4fb42251 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -560,6 +560,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_syscache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b8de13f03b..6099a828d2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9669,6 +9669,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..69b9a976f0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..b6bfd7d644 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 5d24809900..4d51975920 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -65,10 +65,8 @@ typedef struct catcache
     int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -81,7 +79,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..0ab441a364 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_CATCACHE_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2c8e21baa7..7bd77e9972 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3

From c1a947892dd3f96cc6200c4a27b0c8a24d1c3469 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 6 Feb 2019 14:36:29 +0900
Subject: [PATCH 3/4] PoC add prune-by-number-of-entries feature

Adds prune based on the number of cache entries on top of the current
pruning patch. It is controlled by two GUC variables.

cache_entry_limit: limit of the number of entries per catcache
cache_entry_limit_prune_ratio: how much of entries to remove at pruning
---
 src/backend/utils/cache/catcache.c | 108 ++++++++++++++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c       |  40 ++++++++++++++
 src/include/utils/catcache.h       |   2 +
 3 files changed, 149 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 950576fea0..ecea5b603c 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -77,6 +77,11 @@
  */
 int cache_memory_target = 0;
 
+
+/* PoC entry limit */
+int cache_entry_limit = 0;
+double cache_entry_limit_prune_ratio = 0.8;
+
 /* GUC variable to define the minimum age of entries that will be cosidered to
  * be evicted in seconds. This variable is shared among various cache
  * mechanisms.
@@ -882,6 +887,102 @@ InitCatCache(int id,
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntriesByNum - 
+ *    Poc remove infrequently-used entries by number of entries.
+ */
+static bool
+CatCacheCleanupOldEntriesByNum(CatCache *cp, int cache_entry_limit)
+{
+    int            i;
+    int         n;
+    int            oldndelelem = cp->cc_ntup;
+    int            ndelelem;
+    CatCTup        **ct_array;
+
+    ndelelem = oldndelelem - (int)(cache_entry_limit * cache_entry_limit_prune_ratio);
+
+    /* lower limit: quite arbitrary */
+    if (ndelelem < 256)
+        ndelelem = 256;
+
+    /*
+     * partial sort array: [0] contains latest access entry
+     *                     [1] contains ealiest access entry
+     */
+    ct_array = (CatCTup **) palloc(ndelelem * sizeof(CatCTup*));
+    n = 0;
+
+    /*
+     * Collect entries to be removed, which have older lastaccess.
+     * Using heap bound sort like tuplesort.c.
+     */
+    for (i = 0; i < cp->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount != 0 ||
+                (ct->c_list && ct->c_list->refcount != 0))
+                continue;
+
+            if (n < ndelelem)
+            {
+                /* Fill up the min heap array */
+                int j = n++;
+
+                while (j > 0)
+                {
+                    int i = (j - 1) >> 1;
+
+                    if (ct->lastaccess >= ct_array[i]->lastaccess)
+                        break;
+                    ct_array[j] = ct_array[i];
+                    j = i;
+                }
+                ct_array[j] = ct;
+            }
+            else if (ct->lastaccess > ct_array[0]->lastaccess)
+            {
+                /* older than the oldest in the array, add it */
+                unsigned int i;
+
+                i = 0;
+
+                for (;;)
+                {
+                    unsigned int j = 2 * i + 1;
+
+                    if (j >= n)
+                        break;
+                    if (j + 1 < n &&
+                        ct_array[j]->lastaccess > ct_array[j + 1]->lastaccess)
+                        j++;
+                    if (ct->lastaccess <= ct_array[j]->lastaccess)
+                        break;
+                    ct_array[i] = ct_array[j];
+                    i = j;
+                }
+                ct_array[i] = ct;
+            }
+        }
+    }
+
+    /* Now we have the list of elements to be deleted */
+    for (i = 0 ; i < n && ct_array[i]; i++)
+        CatCacheRemoveCTup(cp, ct_array[i]);
+
+    pfree(ct_array);
+
+    elog(LOG, "Catcache pruned by entry number: id=%d, %d => %d", cp->id, oldndelelem, cp->cc_ntup);
+
+    return true;
+}
+
 /*
  * CatCacheCleanupOldEntries - Remove infrequently-used entries
  *
@@ -923,7 +1024,7 @@ CatCacheCleanupOldEntries(CatCache *cp)
     hash_size = cp->cc_nbuckets * sizeof(dlist_head);
     if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L)
         return false;
-    
+
     /*
      * Search the whole hash for entries to remove. This is a quite time
      * consuming task during catcache lookup, but accetable since now we are
@@ -2049,6 +2150,11 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 
     /* increase refcount so that this survives pruning */
     ct->refcount++;
+
+    /* cap number of entries */
+    if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit)
+        CatCacheCleanupOldEntriesByNum(cache, cache_entry_limit);
+    
     /*
      * If the hash table has become too full, try cleanup by removing
      * infrequently used entries to make a room for the new entry. If it
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 32e41253a6..7bb239a07e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2227,6 +2227,36 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
@@ -3401,6 +3431,16 @@ static struct config_real ConfigureNamesReal[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit_prune_ratio,
+        0.8, 0.0, 1.0,
+        NULL, NULL, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 4d51975920..1f7fb51ac0 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -193,6 +193,8 @@ extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 /* for guc.c, not PGDLLPMPORT'ed */
 extern int cache_prune_min_age;
 extern int cache_memory_target;
+extern int cache_entry_limit;
+extern double cache_entry_limit_prune_ratio;
 
 /* to use as access timestamp of catcache entries */
 extern TimestampTz catcacheclock;
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
On 2019-02-06 17:37:04 +0900, Kyotaro HORIGUCHI wrote:
> At Wed, 06 Feb 2019 15:16:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20190206.151653.117382256.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > > The two should have the same extent of impact on performance when
> > > disabled. I'll take numbers briefly using pgbench.
> 
> (pgbench -j 10 -c 10 -T 120) x 5 times for each.
> 
> A: unpached             : 118.58 tps (stddev 0.44)
> B: pached-not-used[1]   : 118.41 tps (stddev 0.29)
> C: patched-timedprune[2]: 118.41 tps (stddev 0.51)
> D: patched-capped...... : none[3]
> 
> [1]: cache_prune_min_age = 0, cache_entry_limit = 0
> 
> [2]: cache_prune_min_age = 100, cache_entry_limit = 0
>      (Prunes every 100ms)
> 
> [3] I didin't find a sane benchmark for the capping case using
>     vanilla pgbench.
> 
> It doesn't seem to me showing significant degradation on *my*
> box...
> 
> # I found a bug that can remove newly created entry. So v11.

This seems to just benchmark your disk speed, no? ISTM you need to
measure readonly performance, not read/write. And with plenty more
tables than just standard pgbench -S.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Tue, 5 Feb 2019 19:05:26 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190205220526.GA1442@alvherre.pgsql>
> On 2019-Feb-05, Tomas Vondra wrote:
> 
> > I don't think we need to remove the expired entries right away, if there
> > are only very few of them. The cleanup requires walking the hash table,
> > which means significant fixed cost. So if there are only few expired
> > entries (say, less than 25% of the cache), we can just leave them around
> > and clean them if we happen to stumble on them (although that may not be
> > possible with dynahash, which has no concept of expiration) of before
> > enlarging the hash table.
> 
> I think seqscanning the hash table is going to be too slow; Ideriha-san
> idea of having a dlist with the entries in LRU order (where each entry
> is moved to head of list when it is touched) seemed good: it allows you
> to evict older ones when the time comes, without having to scan the rest
> of the entries.  Having a dlist means two more pointers on each cache
> entry AFAIR, so it's not a huge amount of memory.

Ah, I had a separate list in my mind. Sounds reasonable to have
pointers in cache entry. But I'm not sure how much additional
dlist_* impact.

The attached is the new version with the following properties:

- Both prune-by-age and hard limiting feature.
  (Merged into single function, single scan)
  Debug tracking feature in CatCacheCleanupOldEntries is removed
  since it no longer runs a full scan.

  Prune-by-age can be a single-setup-for-all-cache feature but
  the hard limit is obviously not. We could use reloptions for
  the purpose (which is not currently available on pg_class and
  pg_attribute:p). I'll add that if there's no strong objection.
  Or is there anyone comes up with something sutable for the
  purpose?

- Using LRU to get rid of full scan.

I added new API dlist_move_to_tail which was needed to construct LRU.

I'm going to retake numbers with search-only queries.

> > So if we want to address this case too (and we probably want), we may
> > need to discard the old cache memory context someho (e.g. rebuild the
> > cache in a new one, and copy the non-expired entries). Which is a nice
> > opportunity to do the "full" cleanup, of course.
> 
> Yeah, we probably don't want to do this super frequently though.

MemoryContext per cache?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 72a569703662b93fb57c55c337b16107ebccfce3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/4] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
     dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+    /* fast path if it's already at the tail */
+    if (head->head.prev == node)
+        return;
+
+    dlist_delete(node);
+    dlist_push_tail(head, node);
+
+    dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

From 5919f1495f27faefdc09abe65fd6e374fa83d9ff Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 2/4] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.

This also can put a hard limit on the number of catcache entries.
---
 doc/src/sgml/config.sgml                      |  38 ++++++
 src/backend/access/transam/xact.c             |   5 +
 src/backend/utils/cache/catcache.c            | 190 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  63 +++++++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  32 ++++-
 6 files changed, 322 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9b7a7388d5..d0d2374944 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92bda87804..ddc433c59e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -734,7 +734,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     if (!IsParallelWorker())
+    {
         stmtStartTimestamp = GetCurrentTimestamp();
+
+        /* Set this timestamp as aproximated current time */
+        SetCatCacheClock(stmtStartTimestamp);
+    }
     else
         Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 258a1d64cc..0a56390352 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,32 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+
+/*
+ * GUC for entry limit. Entries are removed when the number of them goes above
+ * cache_entry_limit by the ratio specified by cache_entry_limit_prune_ratio
+ */
+int cache_entry_limit = 0;
+double cache_entry_limit_prune_ratio = 0.8;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -481,6 +504,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 
     /* delink from linked list */
     dlist_delete(&ct->cache_elem);
+    dlist_delete(&ct->lru_node);
 
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
@@ -490,6 +514,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -841,7 +866,9 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
+    dlist_init(&cp->cc_lru_list);
     /*
      * new cache is initialized as far as we can go for now. print some
      * debugging information, if appropriate.
@@ -858,9 +885,133 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+#define PRUNE_BY_AGE    0x01
+#define PRUNE_BY_NUMBER    0x02
+
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            nremoved = 0;
+    size_t        hash_size;
+    int            nelems_before = cp->cc_ntup;
+    int            ndelelems = 0;
+    int            action = 0;
+    dlist_mutable_iter    iter;
+
+    if (cache_prune_min_age >= 0)
+    {
+        /* prune only if the size of the hash is above the target */
+
+        hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+        if (hash_size + cp->cc_tupsize > (Size) cache_memory_target * 1024L)
+            action |= PRUNE_BY_AGE;
+    }
+
+    if (cache_entry_limit > 0 && nelems_before >= cache_entry_limit)
+    {
+        ndelelems = nelems_before -
+            (int) (cache_entry_limit * cache_entry_limit_prune_ratio);
+
+        if (ndelelems < 256)
+            ndelelems = 256;
+        if (ndelelems > nelems_before)
+            ndelelems = nelems_before;
+
+        action |= PRUNE_BY_NUMBER;
+    }
+
+    /* Return immediately if no pruning is wanted */
+    if (action == 0)
+        return false;
+
+    /* Scan over LRU to find entries to remove */
+    dlist_foreach_modify(iter, &cp->cc_lru_list)
+    {
+        CatCTup    *ct = dlist_container(CatCTup, lru_node, iter.cur);
+        bool        remove_this = false;
+
+        /* We don't remove referenced entry */
+        if (ct->refcount != 0 ||
+            (ct->c_list && ct->c_list->refcount != 0))
+            continue;
+
+        /* check against age */
+        if (action & PRUNE_BY_AGE)
+        {
+            long    entry_age;
+            int        us;
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            if (entry_age < cache_prune_min_age)
+            {
+                /* no longer have a business with further entries, exit  */
+                action &= ~PRUNE_BY_AGE;
+                    break;
+            }
+
+            /*
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (ct->naccess > 0)
+                ct->naccess--;
+            else 
+                remove_this = true;
+        }
+
+        /* check against entry number */
+        if (action & PRUNE_BY_NUMBER)
+        {
+            if (nremoved < ndelelems)
+                remove_this = true;
+            else
+                action &= ~PRUNE_BY_NUMBER; /* satisfied */
+        }
+
+        /* exit if finished */
+        if (action == 0)
+            break;
+
+        /* do the work */
+        if (remove_this)
+        {
+            CatCacheRemoveCTup(cp, ct);
+            nremoved++;
+        }
+    }
+
+    elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+         cp->id, cp->cc_relname, nremoved, nelems_before);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1274,6 +1425,12 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+        ct->lastaccess = catcacheclock;
+        dlist_move_tail(&cache->cc_lru_list, &ct->lru_node);
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1819,11 +1976,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1842,13 +2001,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1876,8 +2036,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1898,18 +2058,34 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    dlist_push_tail(&cache->cc_lru_list, &ct->lru_node);
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
+
+    /* increase refcount so that this survives pruning */
+    ct->refcount++;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
+    /* we may still want to prune by entry number, check it */
+    else if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit)
+        CatCacheCleanupOldEntries(cache);
+
+    ct->refcount--;
 
     return ct;
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8681ada33a..d4df841982 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -81,6 +81,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2204,6 +2205,58 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
@@ -3368,6 +3421,16 @@ static struct config_real ConfigureNamesReal[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit_prune_ratio,
+        0.8, 0.0, 1.0,
+        NULL, NULL, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c7f53470df..108d332f2c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..973a87c2cf 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,8 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    dlist_head    cc_lru_list;
+    int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +122,10 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    dlist_node    lru_node;        /* LRU node */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +195,30 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+extern int cache_entry_limit;
+extern double cache_entry_limit_prune_ratio;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 2591d5984d5fb9f2fd4cca0ecb8c68431311790a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 15:48:28 +0900
Subject: [PATCH 3/4] Syscache usage tracking feature.

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  15 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 201 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            |  89 +++++++++---
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   6 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 559 insertions(+), 36 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d0d2374944..5ff3ebeb4e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval">
+      <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect system cache usage statistics in
+        milliseconds. This parameter is 0 by default, which means disabled.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..f5d1aaf96f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..a1939958b7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -124,6 +125,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only databsae stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_remove_syscache_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -630,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /* clear syscache statistics files and temprary settings */
+    if (MyBackendId != InvalidBackendId)
+        pgstat_remove_syscache_statsfile();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* removes syscache stats files of this backend */
+static void
+pgstat_remove_syscache_statsfile(void)
+{
+    char    fname[MAXPGPATH];
+
+    pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                     fname, MAXPGPATH);
+    unlink(fname);        /* don't care of the result */
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and resturns the
+ * time remaining in the current interval in milliseconds. If'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    {
+        /* disabled. remove the statistics file if any */
+        if (last_report > 0)
+        {
+            last_report = 0;
+            pgstat_remove_syscache_statsfile();
+        }
+        return 0;
+    }
+    
+    /* Check aginst the interval */
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now update the stats */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cfd507b2..fb77a0ce4c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3157,6 +3157,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 }
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_catcache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_catcache_update_timeout = true;
+                    enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_catcache_update_timeout)
+        {
+            disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false);
+            disable_idle_catcache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..6526cfefb4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        fread(&cacheid, sizeof(int), 1, fpin);
+        fread(&last_update, sizeof(TimestampTz), 1, fpin);
+        if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 0a56390352..bdcc10064f 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -97,6 +97,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Timestamp used for any operation on caches. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -628,9 +632,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -706,9 +708,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -958,10 +958,10 @@ CatCacheCleanupOldEntries(CatCache *cp)
             int        us;
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
@@ -1381,9 +1381,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1444,9 +1442,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1455,9 +1451,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1585,9 +1579,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1698,9 +1690,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1757,9 +1747,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2276,3 +2264,64 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /* cache_prune_min_age can be changed on-session, fill it every time */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..f039ecd805 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index c0b6231458..dee7f19475 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d4df841982..7bb239a07e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3198,6 +3198,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 108d332f2c..4d4fb42251 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -560,6 +560,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_syscache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b8de13f03b..6099a828d2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9669,6 +9669,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..69b9a976f0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..b6bfd7d644 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 973a87c2cf..85fa7bdb86 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -66,10 +66,8 @@ typedef struct catcache
     int            cc_tupsize;        /* total amount of catcache tuples */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -82,7 +80,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -258,4 +255,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..0ab441a364 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_CATCACHE_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2c8e21baa7..7bd77e9972 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
Hi, thanks for recent rapid work. 

>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
>At Tue, 5 Feb 2019 19:05:26 -0300, Alvaro Herrera <alvherre@2ndquadrant.com>
>wrote in <20190205220526.GA1442@alvherre.pgsql>
>> On 2019-Feb-05, Tomas Vondra wrote:
>>
>> > I don't think we need to remove the expired entries right away, if
>> > there are only very few of them. The cleanup requires walking the
>> > hash table, which means significant fixed cost. So if there are only
>> > few expired entries (say, less than 25% of the cache), we can just
>> > leave them around and clean them if we happen to stumble on them
>> > (although that may not be possible with dynahash, which has no
>> > concept of expiration) of before enlarging the hash table.
>>
>> I think seqscanning the hash table is going to be too slow;
>> Ideriha-san idea of having a dlist with the entries in LRU order
>> (where each entry is moved to head of list when it is touched) seemed
>> good: it allows you to evict older ones when the time comes, without
>> having to scan the rest of the entries.  Having a dlist means two more
>> pointers on each cache entry AFAIR, so it's not a huge amount of memory.
>
>Ah, I had a separate list in my mind. Sounds reasonable to have pointers in cache entry.
>But I'm not sure how much additional
>dlist_* impact.

Thank you for picking up my comment, Alvaro.
That's what I was thinking about.

>The attached is the new version with the following properties:
>
>- Both prune-by-age and hard limiting feature.
>  (Merged into single function, single scan)
>  Debug tracking feature in CatCacheCleanupOldEntries is removed
>  since it no longer runs a full scan.
It seems to me that adding hard limit strategy choice besides prune-by-age one is good
to help variety of (contradictory) cases which have been discussed in this thread. I need hard limit as well.

The hard limit is currently represented as number of cache entry 
controlled by both cache_entry_limit and cache_entry_limit_prune_ratio. 
Why don't we change it to the amount of memory (bytes)? 
Amount of memory is more direct parameter for customer who wants to
set the hard limit and easier to tune compared to number of cache entry.

>- Using LRU to get rid of full scan.
>
>I added new API dlist_move_to_tail which was needed to construct LRU.

I just thought there is dlist_move_head() so if new entries are 
head side and old ones are tail side. But that's not objection to adding 
new API because depending on the situation head for new entry could be readable code 
and vice versa. 

Regards,
Takeshi Ideriha



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp>
> I'm going to retake numbers with search-only queries.

Yeah, I was stupid.

I made a rerun of benchmark using "-S -T 30" on the server build
with no assertion and -O2. The numbers are the best of three
successive attempts.  The patched version is running with
cache_target_memory = 0, cache_prune_min_age = 600 and
cache_entry_limit = 0 but pruning doesn't happen by the workload.

master: 13393 tps
v12   : 12625 tps (-6%)

Significant degradation is found.

Recuded frequency of dlist_move_tail by taking 1ms interval
between two succesive updates on the same entry let the
degradation dissapear.

patched  : 13720 tps (+2%)

I think there's still no need of such frequency. It is 100ms in
the attched patch.

# I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..

The attached 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 72a569703662b93fb57c55c337b16107ebccfce3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/4] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
     dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+    /* fast path if it's already at the tail */
+    if (head->head.prev == node)
+        return;
+
+    dlist_delete(node);
+    dlist_push_tail(head, node);
+
+    dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

From 429001e7cbbb88710cfc5589bc46e2490f93d216 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 2/4] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.

This also can put a hard limit on the number of catcache entries.
---
 doc/src/sgml/config.sgml                      |  38 +++++
 src/backend/access/transam/xact.c             |   5 +
 src/backend/utils/cache/catcache.c            | 205 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  63 ++++++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/utils/catcache.h                  |  33 ++++-
 6 files changed, 338 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9b7a7388d5..d0d2374944 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1662,6 +1662,44 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target">
+      <term><varname>syscache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning. The value defaults to 0, indicating that pruning is
+        always considered. After exceeding this size, syscache pruning is
+        considered according to
+        <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep
+        certain amount of syscache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age">
+      <term><varname>syscache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        syscache entry is considered to be removed. -1 indicates that syscache
+        pruning is disabled at all. The value defaults to 600 seconds
+        (<literal>10 minutes</literal>). The syscache entries that are not
+        used for the duration can be removed to prevent syscache bloat. This
+        behavior is suppressed until the size of syscache exceeds
+        <xref linkend="guc-syscache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92bda87804..ddc433c59e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -734,7 +734,12 @@ void
 SetCurrentStatementStartTimestamp(void)
 {
     if (!IsParallelWorker())
+    {
         stmtStartTimestamp = GetCurrentTimestamp();
+
+        /* Set this timestamp as aproximated current time */
+        SetCatCacheClock(stmtStartTimestamp);
+    }
     else
         Assert(stmtStartTimestamp != 0);
 }
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 258a1d64cc..c70ce3b745 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -71,9 +71,38 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int cache_memory_target = 0;
+
+
+/*
+ * GUC for entry limit. Entries are removed when the number of them goes above
+ * cache_entry_limit by the ratio specified by cache_entry_limit_prune_ratio
+ */
+int cache_entry_limit = 0;
+double cache_entry_limit_prune_ratio = 0.8;
+
+/* GUC variable to define the minimum age of entries that will be cosidered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int cache_prune_min_age = 600;
+
+/*
+ * Ignorance interval between two success move of a cache entry in LRU list,
+ * in microseconds.
+ */
+#define LRU_IGNORANCE_INTERVAL 100000    /* 100ms */
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Timestamp used for any operation on caches. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -481,6 +510,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 
     /* delink from linked list */
     dlist_delete(&ct->cache_elem);
+    dlist_delete(&ct->lru_node);
 
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
@@ -490,6 +520,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_tupsize -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -841,7 +872,9 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_tupsize = 0;
 
+    dlist_init(&cp->cc_lru_list);
     /*
      * new cache is initialized as far as we can go for now. print some
      * debugging information, if appropriate.
@@ -858,9 +891,133 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initilize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had no access in the same duration.
+ */
+#define PRUNE_BY_AGE    0x01
+#define PRUNE_BY_NUMBER    0x02
+
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            nremoved = 0;
+    size_t        hash_size;
+    int            nelems_before = cp->cc_ntup;
+    int            ndelelems = 0;
+    int            action = 0;
+    dlist_mutable_iter    iter;
+
+    if (cache_prune_min_age >= 0)
+    {
+        /* prune only if the size of the hash is above the target */
+
+        hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+        if (hash_size + cp->cc_tupsize > (Size) cache_memory_target * 1024L)
+            action |= PRUNE_BY_AGE;
+    }
+
+    if (cache_entry_limit > 0 && nelems_before >= cache_entry_limit)
+    {
+        ndelelems = nelems_before -
+            (int) (cache_entry_limit * cache_entry_limit_prune_ratio);
+
+        if (ndelelems < 256)
+            ndelelems = 256;
+        if (ndelelems > nelems_before)
+            ndelelems = nelems_before;
+
+        action |= PRUNE_BY_NUMBER;
+    }
+
+    /* Return immediately if no pruning is wanted */
+    if (action == 0)
+        return false;
+
+    /* Scan over LRU to find entries to remove */
+    dlist_foreach_modify(iter, &cp->cc_lru_list)
+    {
+        CatCTup    *ct = dlist_container(CatCTup, lru_node, iter.cur);
+        bool        remove_this = false;
+
+        /* We don't remove referenced entry */
+        if (ct->refcount != 0 ||
+            (ct->c_list && ct->c_list->refcount != 0))
+            continue;
+
+        /* check against age */
+        if (action & PRUNE_BY_AGE)
+        {
+            long    entry_age;
+            int        us;
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            if (entry_age < cache_prune_min_age)
+            {
+                /* no longer have a business with further entries, exit  */
+                action &= ~PRUNE_BY_AGE;
+                    break;
+            }
+
+            /*
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (ct->naccess > 0)
+                ct->naccess--;
+            else 
+                remove_this = true;
+        }
+
+        /* check against entry number */
+        if (action & PRUNE_BY_NUMBER)
+        {
+            if (nremoved < ndelelems)
+                remove_this = true;
+            else
+                action &= ~PRUNE_BY_NUMBER; /* satisfied */
+        }
+
+        /* exit if finished */
+        if (action == 0)
+            break;
+
+        /* do the work */
+        if (remove_this)
+        {
+            CatCacheRemoveCTup(cp, ct);
+            nremoved++;
+        }
+    }
+
+    elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+         cp->id, cp->cc_relname, nremoved, nelems_before);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1274,6 +1431,21 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        /*
+         * We don't want too frequent update of LRU. cache_prune_min_age can
+         * be changed on-session so we need to maintan the LRU regardless of
+         * cache_prune_min_age.
+         */
+        if (catcacheclock - ct->lastaccess > LRU_IGNORANCE_INTERVAL)
+        {
+            ct->lastaccess = catcacheclock;
+            dlist_move_tail(&cache->cc_lru_list, &ct->lru_node);
+        }
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1819,11 +1991,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize = 0;
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1842,13 +2016,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
         ct->tuple.t_data = (HeapTupleHeader)
             MAXALIGN(((char *) ct) + sizeof(CatCTup));
+        ct->size = tupsize;
         /* copy tuple contents */
         memcpy((char *) ct->tuple.t_data,
                (const char *) dtp->t_data,
@@ -1876,8 +2051,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
-
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1898,18 +2073,34 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    dlist_push_tail(&cache->cc_lru_list, &ct->lru_node);
+    ct->size = tupsize;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
+    cache->cc_tupsize += tupsize;
+
+    /* increase refcount so that this survives pruning */
+    ct->refcount++;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
+    /* we may still want to prune by entry number, check it */
+    else if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit)
+        CatCacheCleanupOldEntries(cache);
+
+    ct->refcount--;
 
     return ct;
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8681ada33a..d4df841982 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -81,6 +81,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2204,6 +2205,58 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Cache is not pruned before exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &cache_prune_min_age,
+        600, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
@@ -3368,6 +3421,16 @@ static struct config_real ConfigureNamesReal[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &cache_entry_limit_prune_ratio,
+        0.8, 0.0, 1.0,
+        NULL, NULL, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c7f53470df..108d332f2c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#cache_memory_target = 0kB    # in kB
+#cache_prune_min_age = 600s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..3c6842e272 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,9 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    dlist_head    cc_lru_list;
+    int            cc_tupsize;        /* total amount of catcache tuples */
+    int            cc_nfreeent;    /* # of entries currently not referenced */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +123,10 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    dlist_node    lru_node;        /* LRU node */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +196,30 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int cache_prune_min_age;
+extern int cache_memory_target;
+extern int cache_entry_limit;
+extern double cache_entry_limit_prune_ratio;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 251607ff21981f840392387a28ca8f012ef18aab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 15:48:28 +0900
Subject: [PATCH 3/4] Syscache usage tracking feature.

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  15 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 201 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            |  89 +++++++++---
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   6 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 559 insertions(+), 36 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d0d2374944..5ff3ebeb4e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval">
+      <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect system cache usage statistics in
+        milliseconds. This parameter is 0 by default, which means disabled.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..f5d1aaf96f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..a1939958b7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -124,6 +125,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only databsae stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_remove_syscache_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -630,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /* clear syscache statistics files and temprary settings */
+    if (MyBackendId != InvalidBackendId)
+        pgstat_remove_syscache_statsfile();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* removes syscache stats files of this backend */
+static void
+pgstat_remove_syscache_statsfile(void)
+{
+    char    fname[MAXPGPATH];
+
+    pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                     fname, MAXPGPATH);
+    unlink(fname);        /* don't care of the result */
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and resturns the
+ * time remaining in the current interval in milliseconds. If'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    {
+        /* disabled. remove the statistics file if any */
+        if (last_report > 0)
+        {
+            last_report = 0;
+            pgstat_remove_syscache_statsfile();
+        }
+        return 0;
+    }
+    
+    /* Check aginst the interval */
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now update the stats */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cfd507b2..fb77a0ce4c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3157,6 +3157,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 }
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_catcache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_catcache_update_timeout = true;
+                    enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_catcache_update_timeout)
+        {
+            disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false);
+            disable_idle_catcache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..6526cfefb4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        fread(&cacheid, sizeof(int), 1, fpin);
+        fread(&last_update, sizeof(TimestampTz), 1, fpin);
+        if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index c70ce3b745..484fe43e09 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -103,6 +103,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Timestamp used for any operation on caches. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -634,9 +638,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -712,9 +714,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -964,10 +964,10 @@ CatCacheCleanupOldEntries(CatCache *cp)
             int        us;
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
@@ -1387,9 +1387,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1459,9 +1457,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1470,9 +1466,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1600,9 +1594,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1713,9 +1705,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1772,9 +1762,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2291,3 +2279,64 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head);
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /* cache_prune_min_age can be changed on-session, fill it every time */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..f039ecd805 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index c0b6231458..dee7f19475 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d4df841982..7bb239a07e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3198,6 +3198,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 108d332f2c..4d4fb42251 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -560,6 +560,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_syscache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b8de13f03b..6099a828d2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9669,6 +9669,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..69b9a976f0 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..b6bfd7d644 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 3c6842e272..9af414b307 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -67,10 +67,8 @@ typedef struct catcache
     int            cc_nfreeent;    /* # of entries currently not referenced */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -83,7 +81,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -259,4 +256,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..0ab441a364 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_CATCACHE_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2c8e21baa7..7bd77e9972 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
>I made a rerun of benchmark using "-S -T 30" on the server build with no assertion and
>-O2. The numbers are the best of three successive attempts.  The patched version is
>running with cache_target_memory = 0, cache_prune_min_age = 600 and
>cache_entry_limit = 0 but pruning doesn't happen by the workload.
>
>master: 13393 tps
>v12   : 12625 tps (-6%)
>
>Significant degradation is found.
>
>Recuded frequency of dlist_move_tail by taking 1ms interval between two succesive
>updates on the same entry let the degradation dissapear.
>
>patched  : 13720 tps (+2%)

It would be good to introduce some interval.
I followed your benchmark (initialized scale factor=10 and others are same option) 
and found the same tendency. 
These are average of 5 trials.
master:   7640.000538 
patch_v12:7417.981378 (3 % down against master)
patch_v13:7645.071787 (almost same as master)

These cases are not pruning happen workload as you mentioned.
I'd like to do benchmark of cache-pruning-case as well.
To demonstrate cache-pruning-case
right now I'm making hundreds of partitioned table and run select query for each partitioned table
using pgbench custom file. Maybe using small number of cache_prune_min_age or hard limit would be better.
Are there any good model?

># I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..
How about MIN_LRU_UPDATE_INTERVAL?

Regards,
Takeshi Ideriha



Re: Protect syscache from bloating with negative cache entries

От
"MauMau"
Дата:
From: Tomas Vondra
> I don't think we need to remove the expired entries right away, if
there
> are only very few of them. The cleanup requires walking the hash
table,
> which means significant fixed cost. So if there are only few expired
> entries (say, less than 25% of the cache), we can just leave them
around
> and clean them if we happen to stumble on them (although that may
not be
> possible with dynahash, which has no concept of expiration) of
before
> enlarging the hash table.

I agree in that we don't need to evict cache entries as long as the
memory permits (within the control of the DBA.)

But how does the concept of expiration fit the catcache?  How would
the user determine the expiration time, i.e. setting of
syscache_prune_min_age?  If you set a small value to evict unnecessary
entries faster, necessary entries will also be evicted.  Some access
counter would keep accessed entries longer, but some idle time (e.g.
lunch break) can flush entries that you want to access after the lunch
break.

The idea of expiration applies to the case where we want possibly
stale entries to vanish and load newer data upon the next access.  For
example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP.  Is the
catcache based on the same idea with them?  No.

What we want to do is to evict never or infrequently used cache
entries.  That's naturally the task of LRU, isn't it?  Even the high
performance Memcached and Redis uses LRU when the cache is full.  As
Bruce said, we don't have to be worried about the lock contention or
something, because we're talking about the backend local cache.  Are
we worried about the overhead of manipulating the LRU chain?  The
current catcache already does it on every access; it calls
dlist_move_head() to put the accessed entry to the front of the hash
bucket.


> So if we want to address this case too (and we probably want), we
may
> need to discard the old cache memory context someho (e.g. rebuild
the
> cache in a new one, and copy the non-expired entries). Which is a
nice
> opportunity to do the "full" cleanup, of course.

The straightforward, natural, and familiar way is to limit the cache
size, which I mentioned in some previous mail.  We should give the DBA
the ability to control memory usage, rather than considering what to
do after leaving the memory area grow unnecessarily too large.  That's
what a typical "cache" is, isn't it?

https://en.wikipedia.org/wiki/Cache_(computing)

"To be cost-effective and to enable efficient use of data, caches must
be relatively small."


Another relevant suboptimal idea would be to provide each catcache
with a separate memory context, which is the child of
CacheMemoryContext.  This gives slight optimization by using the slab
context (slab.c) for a catcache with fixed-sized tuples.  But that'd
be a bit complex, I'm afraid for PG 12.


Regards
MauMau




Re: Protect syscache from bloating with negative cache entries

От
"MauMau"
Дата:
From: Alvaro Herrera
> I think seqscanning the hash table is going to be too slow;
Ideriha-san
> idea of having a dlist with the entries in LRU order (where each
entry
> is moved to head of list when it is touched) seemed good: it allows
you
> to evict older ones when the time comes, without having to scan the
rest
> of the entries.  Having a dlist means two more pointers on each
cache
> entry AFAIR, so it's not a huge amount of memory.

Absolutely.  We should try to avoid unpredictable long response time
caused by an occasional unlucky batch processing.  That makes the
troubleshooting when the user asks why they experience unsteady
response time.

Regards
MauMau







Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:

On 2/8/19 2:27 PM, MauMau wrote:
> From: Tomas Vondra
>> I don't think we need to remove the expired entries right away, if
>> there are only very few of them. The cleanup requires walking the
>> hash table, which means significant fixed cost. So if there are
>> only few expired entries (say, less than 25% of the cache), we can
>> just leave them around and clean them if we happen to stumble on
>> them (although that may not be possible with dynahash, which has no
>> concept of expiration) of before enlarging the hash table.
> 
> I agree in that we don't need to evict cache entries as long as the 
> memory permits (within the control of the DBA.)
> 
> But how does the concept of expiration fit the catcache?  How would 
> the user determine the expiration time, i.e. setting of 
> syscache_prune_min_age?  If you set a small value to evict
> unnecessary entries faster, necessary entries will also be evicted.
> Some access counter would keep accessed entries longer, but some idle
> time (e.g. lunch break) can flush entries that you want to access
> after the lunch break.
> 

I'm not sure what you mean by "necessary" and "unnecessary" here. What
matters is how often an entry is accessed - if it's accessed often, it
makes sense to keep it in the cache. Otherwise evict it. Entries not
accessed for 5 minutes are clearly not accessed very often, so and
getting rid of them will not hurt the cache hit ratio very much.

So I agree with Robert a time-based approach should work well here. It
does not have the issues with setting exact syscache size limit, it's
kinda self-adaptive etc.

In a way, this is exactly what the 5 minute rule [1] says about caching.

[1] http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf


> The idea of expiration applies to the case where we want possibly 
> stale entries to vanish and load newer data upon the next access.
> For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP.
> Is the catcache based on the same idea with them?  No.
> 

I'm not sure what has this to do with those other databases.

> What we want to do is to evict never or infrequently used cache
> entries.  That's naturally the task of LRU, isn't it?  Even the high
> performance Memcached and Redis uses LRU when the cache is full.  As
> Bruce said, we don't have to be worried about the lock contention or
> something, because we're talking about the backend local cache.  Are
> we worried about the overhead of manipulating the LRU chain?  The
> current catcache already does it on every access; it calls
> dlist_move_head() to put the accessed entry to the front of the hash
> bucket.
> 

I'm certainly worried about the performance aspect of it. The syscache
is in a plenty of hot paths, so adding overhead may have significant
impact. But that depends on how complex the eviction criteria will be.

And then there may be cases conflicting with the criteria, i.e. running
into just-evicted entries much more often. This is the issue with the
initially proposed hard limits on cache sizes, where it'd be trivial to
under-size it just a little bit.

> 
>> So if we want to address this case too (and we probably want), we 
>> may need to discard the old cache memory context somehow (e.g. 
>> rebuild the cache in a new one, and copy the non-expired entries). 
>> Which is a nice opportunity to do the "full" cleanup, of course.
> 
> The straightforward, natural, and familiar way is to limit the cache 
> size, which I mentioned in some previous mail.  We should give the
> DBA the ability to control memory usage, rather than considering what
> to do after leaving the memory area grow unnecessarily too large.
> That's what a typical "cache" is, isn't it?
> 

Not sure which mail you're referring to - this seems to be the first
e-mail from you in this thread (per our archives).

I personally don't find explicit limit on cache size very attractive,
because it's rather low-level and difficult to tune, and very easy to
get it wrong (at which point you fall from a cliff). All the information
is in backend private memory, so how would you even identify syscache is
the thing you need to tune, or how would you determine the correct size?

> https://en.wikipedia.org/wiki/Cache_(computing)
> 
> "To be cost-effective and to enable efficient use of data, caches must
> be relatively small."
> 

Relatively small compared to what? It's also a question of how expensive
cache misses are.

> 
> Another relevant suboptimal idea would be to provide each catcache
> with a separate memory context, which is the child of
> CacheMemoryContext.  This gives slight optimization by using the slab
> context (slab.c) for a catcache with fixed-sized tuples.  But that'd
> be a bit complex, I'm afraid for PG 12.
> 

I don't know, but that does not seem very attractive. Each memory
context has some overhead, and it does not solve the issue of never
releasing memory to the OS. So we'd still have to rebuild the contexts
at some point, I'm afraid.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:
On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote:
> At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> I'm going to retake numbers with search-only queries.
> 
> Yeah, I was stupid.
> 
> I made a rerun of benchmark using "-S -T 30" on the server build
> with no assertion and -O2. The numbers are the best of three
> successive attempts.  The patched version is running with
> cache_target_memory = 0, cache_prune_min_age = 600 and
> cache_entry_limit = 0 but pruning doesn't happen by the workload.
> 
> master: 13393 tps
> v12   : 12625 tps (-6%)
> 
> Significant degradation is found.
> 
> Recuded frequency of dlist_move_tail by taking 1ms interval
> between two succesive updates on the same entry let the
> degradation dissapear.
> 
> patched  : 13720 tps (+2%)
> 
> I think there's still no need of such frequency. It is 100ms in
> the attched patch.
> 
> # I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..
> 

Hi,

I've done a bunch of benchmarks on v13, and I don't see any serious
regression either. Each test creates a number of tables (100, 1k, 10k,
100k and 1M) and then runs SELECT queries on them. The tables are
accessed randomly - with either uniform or exponential distribution. For
each combination there are 5 runs, 60 seconds each (see the attached
shell scripts, it should be pretty obvious).

I've done the tests on two different machines - small one (i5 with 8GB
of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is
almost exactly the same (with the exception of 1M tables, which does not
fit into RAM on the smaller one).

On the xeon, the results (throughput compared to master) look like this:


    uniform           100     1000    10000   100000   1000000
   ------------------------------------------------------------
    v13           105.04%  100.28%  102.96%  102.11%   101.54%
    v13 (nodata)   97.05%   98.30%   97.42%   96.60%   107.55%


    exponential       100     1000    10000   100000   1000000
   ------------------------------------------------------------
    v13           100.04%  103.48%  101.70%   98.56%   103.20%
    v13 (nodata)   97.12%   98.43%   98.86%   98.48%   104.94%

The "nodata" case means the tables were empty (so no files created),
while in the other case each table contained 1 row.

Per the results it's mostly break even, and in some cases there is
actually a measurable improvement.

That being said, the question is whether the patch actually reduces
memory usage in a useful way - that's not something this benchmark
validates. I plan to modify the tests to make pgbench script
time-dependent (i.e. to pick a subset of tables depending on time).

A couple of things I've happened to notice during a quick review:

1) The sgml docs in 0002 talk about "syscache_memory_target" and
"syscache_prune_min_age", but those options were renamed to just
"cache_memory_target" and "cache_prune_min_age".

2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
defined three times in guc.c for some reason.

3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead
of just using two bool variables prune_by_age and prune_by_number doing
the same thing.

4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that
pretty much mean long-running statements will set the lastaccess to very
old timestamp? Also, it means that long-running statements (like a PL
function accessing a bunch of tables) won't do any eviction at all, no?
AFAICS we'll set the timestamp only once, at the very beginning.

I wonder whether using some other timestamp source (like a timestamp
updated regularly from a timer, or something like that).

5) There are two fread() calls in 0003 triggering a compiler warning
about unused return value.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Tomas Vondra <tomas.vondra@2ndquadrant.com>
> I'm not sure what you mean by "necessary" and "unnecessary" here. What
> matters is how often an entry is accessed - if it's accessed often, it makes sense
> to keep it in the cache. Otherwise evict it. Entries not accessed for 5 minutes are
> clearly not accessed very often, so and getting rid of them will not hurt the
> cache hit ratio very much.

Right, "necessary" and "unnecessary" were imprecise, and it matters how frequent the entries are accessed.  What made
mesay "unnecessary" is the pg_statistic entry left by CREATE/DROP TEMP TABLE which is never accessed again.
 


> So I agree with Robert a time-based approach should work well here. It does
> not have the issues with setting exact syscache size limit, it's kinda self-adaptive
> etc.
> 
> In a way, this is exactly what the 5 minute rule [1] says about caching.
> 
> [1] http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf

Then, can we just set 5min to syscache_prune_min_age?  Otherwise, how can users set the expiration period?


> > The idea of expiration applies to the case where we want possibly
> > stale entries to vanish and load newer data upon the next access.
> > For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP.
> > Is the catcache based on the same idea with them?  No.
> >
> 
> I'm not sure what has this to do with those other databases.

I meant that the time-based eviction is not very good, because it could cause less frequently entries to vanish even
whenmemory is not short.  Time-based eviction reminds me of Memcached, Redis, DNS, etc. that evicts long-lived entries
toavoid stale data, not to free space for other entries.  I think size-based eviction is sufficient like
shared_buffers,OS page cache, CPU cache, disk cache, etc.
 


> I'm certainly worried about the performance aspect of it. The syscache is in a
> plenty of hot paths, so adding overhead may have significant impact. But that
> depends on how complex the eviction criteria will be.

The LRU chain manipulation, dlist_move_head() in SearchCatCacheInternal(), may certainly incur some overhead.  If it
hasvisible impact, then we can do the manipulation only when the user set an upper limit on the cache size.
 

> And then there may be cases conflicting with the criteria, i.e. running into
> just-evicted entries much more often. This is the issue with the initially
> proposed hard limits on cache sizes, where it'd be trivial to under-size it just a
> little bit.

In that case, the user can just enlarge the catcache.


> Not sure which mail you're referring to - this seems to be the first e-mail from
> you in this thread (per our archives).

Sorry, MauMau is me, Takayuki Tsunakawa.


> I personally don't find explicit limit on cache size very attractive, because it's
> rather low-level and difficult to tune, and very easy to get it wrong (at which
> point you fall from a cliff). All the information is in backend private memory, so
> how would you even identify syscache is the thing you need to tune, or how
> would you determine the correct size?

Just like other caches, we can present a view that shows the hits, misses, and the hit ratio of the entire catcaches.
Ifthe hit ratio is low, the user can enlarge the catcache size.  That's what Oracle and MySQL do as I referred to in
thisthread.  The tuning parameter is the size.  That's all.  Besides, the v13 patch has as many as 4 parameters:
cache_memory_target,cache_prune_min_age, cache_entry_limit, cache_entry_limit_prune_ratio.  I don't think I can give
theuser good intuitive advice on how to tune these.
 


> > https://en.wikipedia.org/wiki/Cache_(computing)
> >
> > "To be cost-effective and to enable efficient use of data, caches must
> > be relatively small."
> >
> 
> Relatively small compared to what? It's also a question of how expensive cache
> misses are.

I guess the author meant that the cache is "relatively small" compared to the underlying storage: CPU cache is smaller
thanDRAM, DRAM is smaller than SSD/HDD.  In our case, we have to pay more attention to limit the catcache memory
consumption,especially because they are duplicated in multiple backend processes.
 


> I don't know, but that does not seem very attractive. Each memory context has
> some overhead, and it does not solve the issue of never releasing memory to
> the OS. So we'd still have to rebuild the contexts at some point, I'm afraid.

I think there is little additional overhead on each catcache access -- processing overhead is the same as when using
aset,and the memory overhead is as much as several dozens (which is the number of catcaches) of MemoryContext
structure. The slab context (slab.c) returns empty blocks to OS unlike the allocation context (aset.c).
 


Regards
Takayuki Tsunakawa


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> Recuded frequency of dlist_move_tail by taking 1ms interval between two
> succesive updates on the same entry let the degradation dissapear.
> 
> patched  : 13720 tps (+2%)

What do you think contributed to this performance increase?  Or do you hink this is just a measurement variation?

Most of my previous comments also seem to apply to v13, so let me repost them below:


(1)

(1)
+/* GUC variable to define the minimum age of entries that will be cosidered to
+    /* initilize catcache reference clock if haven't done yet */

cosidered -> considered
initilize -> initialize

I remember I saw some other wrong spelling and/or missing words, which I forgot (sorry).


(2)
Only the doc prefixes "sys" to the new parameter names.  Other places don't have it.  I think we should prefix sys,
becauserelcache and plancache should be configurable separately because of their different usage patterns/lifecycle.
 


(3)
The doc doesn't describe the unit of syscache_memory_target.  Kilobytes?


(4)
+    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        tupsize = sizeof(CatCTup);

GetMemoryChunkSpace() should be used to include the memory context overhead.  That's what the files in
src/backend/utils/sort/do.
 


(5)
+            if (entry_age > cache_prune_min_age)

">=" instead of ">"?


(6)
+                    if (!ct->c_list || ct->c_list->refcount == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);

It's better to write "ct->c_list == NULL" to follow the style in this file.

"ct->refcount == 0" should also be checked prior to removing the catcache tuple, just in case the tuple hasn't been
releasedfor a long time, which might hardly happen.
 


(7)
CatalogCacheCreateEntry

+    int            tupsize = 0;
     if (ntp)
     {
         int            i;
+        int            tupsize;

tupsize is defined twice.



(8)
CatalogCacheCreateEntry

In the negative entry case, the memory allocated by CatCacheCopyKeys() is not counted.  I'm afraid that's not
negligible.


(9)
The memory for CatCList is not taken into account for syscache_memory_target.


Regards
Takayuki Tsunakawa




Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:

On 2/12/19 1:49 AM, Tsunakawa, Takayuki wrote:
> From: Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> I'm not sure what you mean by "necessary" and "unnecessary" here. What
>> matters is how often an entry is accessed - if it's accessed often, it makes sense
>> to keep it in the cache. Otherwise evict it. Entries not accessed for 5 minutes are
>> clearly not accessed very often, so and getting rid of them will not hurt the
>> cache hit ratio very much.
> 
> Right, "necessary" and "unnecessary" were imprecise, and it matters
> how frequent the entries are accessed.  What made me say "unnecessary"
> is the pg_statistic entry left by CREATE/DROP TEMP TABLE which is never
> accessed again.
> 

OK, understood.

>> So I agree with Robert a time-based approach should work well here. It does
>> not have the issues with setting exact syscache size limit, it's kinda self-adaptive
>> etc.
>>
>> In a way, this is exactly what the 5 minute rule [1] says about caching.
>>
>> [1] http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf
> 
> Then, can we just set 5min to syscache_prune_min_age?  Otherwise,
> how can users set the expiration period?
> 

I believe so.

>>> The idea of expiration applies to the case where we want possibly
>>> stale entries to vanish and load newer data upon the next access.
>>> For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP.
>>> Is the catcache based on the same idea with them?  No.
>>>
>>
>> I'm not sure what has this to do with those other databases.
> 
> I meant that the time-based eviction is not very good, because it
> could cause less frequently entries to vanish even when memory is not
> short.  Time-based eviction reminds me of Memcached, Redis, DNS, etc.
> that evicts long-lived entries to avoid stale data, not to free space
> for other entries.  I think size-based eviction is sufficient like
> shared_buffers, OS page cache, CPU cache, disk cache, etc.
> 

Right. But the logic behind time-based approach is that evicting such
entries should not cause any issues exactly because they are accessed
infrequently. It might incur some latency when we need them for the
first time after the eviction, but IMHO that's acceptable (although I
see Andres did not like that).

FWIW we might even evict entries after some time passes since inserting
them into the cache - that's what memcached et al do, IIRC. The logic is
that frequently accessed entries will get immediately loaded back (thus
keeping cache hit ratio high). But there are reasons why the other dbs
do that - like not having any cache invalidation (unlike us).

That being said, having a "minimal size" threshold before starting with
the time-based eviction may be a good idea.

>> I'm certainly worried about the performance aspect of it. The syscache is in a
>> plenty of hot paths, so adding overhead may have significant impact. But that
>> depends on how complex the eviction criteria will be.
> 
> The LRU chain manipulation, dlist_move_head() in
> SearchCatCacheInternal(), may certainly incur some overhead.  If it has
> visible impact, then we can do the manipulation only when the user set
> an upper limit on the cache size.
> 

I think the benchmarks done so far suggest the extra overhead is within
noise. So unless we manage to make it much more expensive, we should be
OK I think.

>> And then there may be cases conflicting with the criteria, i.e. running into
>> just-evicted entries much more often. This is the issue with the initially
>> proposed hard limits on cache sizes, where it'd be trivial to under-size it just a
>> little bit.
> 
> In that case, the user can just enlarge the catcache.
> 

IMHO the main issues with this are

(a) It's not quite clear how to determine the appropriate limit. I can
probably apply a bit of perf+gdb, but I doubt that's what very nice.

(b) It's not adaptive, so systems that grow over time (e.g. by adding
schemas and other objects) will keep hitting the limit over and over.

> 
>> Not sure which mail you're referring to - this seems to be the first e-mail from
>> you in this thread (per our archives).
> 
> Sorry, MauMau is me, Takayuki Tsunakawa.
> 

Ah, of course!

> 
>> I personally don't find explicit limit on cache size very attractive, because it's
>> rather low-level and difficult to tune, and very easy to get it wrong (at which
>> point you fall from a cliff). All the information is in backend private memory, so
>> how would you even identify syscache is the thing you need to tune, or how
>> would you determine the correct size?
> 
> Just like other caches, we can present a view that shows the hits, misses, and the hit ratio of the entire catcaches.
If the hit ratio is low, the user can enlarge the catcache size.  That's what Oracle and MySQL do as I referred to in
thisthread.  The tuning parameter is the size.  That's all.
 

How will that work, considering the caches are in private backend
memory? And each backend may have quite different characteristics, even
if they are connected to the same database?

>  Besides, the v13 patch has as many as 4 parameters: cache_memory_target, cache_prune_min_age, cache_entry_limit,
cache_entry_limit_prune_ratio. I don't think I can give the user good intuitive advice on how to tune these.
 
> 

Isn't that more an argument for not having 4 parameters?

> 
>>> https://en.wikipedia.org/wiki/Cache_(computing)
>>>
>>> "To be cost-effective and to enable efficient use of data, caches must
>>> be relatively small."
>>>
>>
>> Relatively small compared to what? It's also a question of how expensive cache
>> misses are.
> 
> I guess the author meant that the cache is "relatively small" compared to the underlying storage: CPU cache is
smallerthan DRAM, DRAM is smaller than SSD/HDD.  In our case, we have to pay more attention to limit the catcache
memoryconsumption, especially because they are duplicated in multiple backend processes.
 
> 

I don't think so. IMHO the focus there in on "cost-effective", i.e.
caches are generally more expensive than the storage, so to make them
worth it you need to make them much smaller than the main storage.
That's pretty much what the 5 minute rule is about, I think.

But I don't see how this applies to the problem at hand, because the
system is already split into storage + cache (represented by RAM). The
challenge is how to use RAM to cache various pieces of data to get the
best behavior. The problem is, we don't have a unified cache, but
multiple smaller ones (shared buffers, page cache, syscache) competing
for the same resource.

Of course, having multiple (different) copies of syscache makes it even
more difficult.

(Does this make sense, or am I just babbling nonsense?)

> 
>> I don't know, but that does not seem very attractive. Each memory context has
>> some overhead, and it does not solve the issue of never releasing memory to
>> the OS. So we'd still have to rebuild the contexts at some point, I'm afraid.
> 
> I think there is little additional overhead on each catcache access
> -- processing overhead is the same as when using aset, and the memory
> overhead is as much as several dozens (which is the number of catcaches)
> of MemoryContext structure.

Hmmm. That doesn't seem particularly terrible, I guess.

> The slab context (slab.c) returns empty blocks to OS unlike the
> allocation context (aset.c).

Slab can do that, but it requires certain allocation pattern, and I very
much doubt syscache has it. It'll be trivial to end with one active
entry on each block (which means slab can't release it).

BTW doesn't syscache store the full on-disk tuple? That doesn't seem
like a fixed-length entry, which is a requirement for slab. No?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
> > I meant that the time-based eviction is not very good, because it
> > could cause less frequently entries to vanish even when memory is not
> > short.  Time-based eviction reminds me of Memcached, Redis, DNS, etc.
> > that evicts long-lived entries to avoid stale data, not to free space
> > for other entries.  I think size-based eviction is sufficient like
> > shared_buffers, OS page cache, CPU cache, disk cache, etc.
> >
> 
> Right. But the logic behind time-based approach is that evicting such
> entries should not cause any issues exactly because they are accessed
> infrequently. It might incur some latency when we need them for the
> first time after the eviction, but IMHO that's acceptable (although I
> see Andres did not like that).

Yes, that's what I expressed.  That is, I'm probably with Andres.

> FWIW we might even evict entries after some time passes since inserting
> them into the cache - that's what memcached et al do, IIRC. The logic is
> that frequently accessed entries will get immediately loaded back (thus
> keeping cache hit ratio high). But there are reasons why the other dbs
> do that - like not having any cache invalidation (unlike us).

These are what Memcached and Redis do:

1. Evict entries that have lived longer than their TTLs.
This is independent of the cache size.  This is to avoid keeping stale data in the cache when the underlying data (such
asin the database) is modified.  This doesn't apply to PostgreSQL.
 

2. Evict most least recently accessed entries.
This is to make room for new entries when the cache is full.  This is similar or the same as PostgreSQL and other DBMSs
dofor their database cache.  Oracle and MySQL also do this for their dictionary caches, where "dictionary cache"
correspondsto syscache in PostgreSQL.
 

Here's my sketch for this feature.  Although it may not meet all (contradictory) requirements as you said, it's simple
andfamiliar for those who have used PostgreSQL and other DBMSs.  What do you think?  The points are simplicity,
familiarity,and memory consumption control for the DBA.
 

* Add a GUC parameter syscache_size which imposes the upper limit on the total size of all catcaches, not on individual
catcache.
The naming follows effective_cache_size.  It can be syscache_mem to follow work_mem and maintenance_work_mem.
The default value is 0, which doesn't limit the cache size as now.

* A new member variable in CatCacheHeader tracks the total size of all cached entries.

* A single new LRU list in CatCacheHeader links all cache tuples in LRU order.  Each cache access,
SearchSCatCacheInternal(),puts the found entry on its front.
 

* Insertion of a new catcache entry adds the entry size to the total cache size.  If the total size exceeds the limit
definedby syscache_size, most infrequently accessed entries are removed until the total cache size gets below the
limit.
This eviction results in slight overhead when the cache is full, but the response time is steady.  On the other hand,
withthe proposed approach, users will wonder about mysterious long response time due to bulk entry deletions.
 


> > In that case, the user can just enlarge the catcache.
> >
> 
> IMHO the main issues with this are
> 
> (a) It's not quite clear how to determine the appropriate limit. I can
> probably apply a bit of perf+gdb, but I doubt that's what very nice.

Like Oracle and MySQL, the user should be able to see the cache hit ratio with a statistics view.


> (b) It's not adaptive, so systems that grow over time (e.g. by adding
> schemas and other objects) will keep hitting the limit over and over.

The user needs to restart the database instance to enlarge the syscache.  That's also true for shared buffers: to
accommodategrowing amoun of data, the user needs to increase shared_buffers and restart the server.
 

But the current syscache is local memory, so the server may not need restart.


> > Just like other caches, we can present a view that shows the hits, misses,
> and the hit ratio of the entire catcaches.  If the hit ratio is low, the
> user can enlarge the catcache size.  That's what Oracle and MySQL do as
> I referred to in this thread.  The tuning parameter is the size.  That's
> all.
> 
> How will that work, considering the caches are in private backend
> memory? And each backend may have quite different characteristics, even
> if they are connected to the same database?

Assuming that pg_stat_syscache (pid, cache_name, hits, misses) gives the statistics, the statistics data can be stored
onthe shared memory, because the number of backends and the number of catcaches are fixed.
 


> > I guess the author meant that the cache is "relatively small" compared
> to the underlying storage: CPU cache is smaller than DRAM, DRAM is smaller
> than SSD/HDD.  In our case, we have to pay more attention to limit the
> catcache memory consumption, especially because they are duplicated in
> multiple backend processes.
> >
> 
> I don't think so. IMHO the focus there in on "cost-effective", i.e.
> caches are generally more expensive than the storage, so to make them
> worth it you need to make them much smaller than the main storage.

I think we're saying the same thing.  Perhaps my English is not good enough.


> But I don't see how this applies to the problem at hand, because the
> system is already split into storage + cache (represented by RAM). The
> challenge is how to use RAM to cache various pieces of data to get the
> best behavior. The problem is, we don't have a unified cache, but
> multiple smaller ones (shared buffers, page cache, syscache) competing
> for the same resource.

You're right.  On the other hand, we can consider syscache, shared buffers, and page cache as different tiers of
storage,even though they are all on DRAM.  syscache caches some data from shared buffers for efficient access.  If we
usemuch memory for syscache, there's less memory for caching user data in shared buffers and page cache.  That's a
normaltradeoff of caches.
 


> Slab can do that, but it requires certain allocation pattern, and I very
> much doubt syscache has it. It'll be trivial to end with one active
> entry on each block (which means slab can't release it).

I expect so, too, although slab context makes efforts to mitigate that possibility like this:

 *  This also allows various optimizations - for example when searching for
 *  free chunk, the allocator reuses space from the fullest blocks first, in
 *  the hope that some of the less full blocks will get completely empty (and
 *  returned back to the OS).


> BTW doesn't syscache store the full on-disk tuple? That doesn't seem
> like a fixed-length entry, which is a requirement for slab. No?

Some system catalogs are fixed in size like pg_am and pg_amop.  But I guess the number of such catalogs is small.
Dominantcatalogs like pg_class and pg_attribute are variable size.  So using different memory contexts for limited
catalogsmight not show any visible performance improvement nor memory reduction.
 


Regards
Takayuki Tsunakawa




Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Fri, 8 Feb 2019 09:42:20 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in
<4E72940DA2BF16479384A86D54D0988A6F41EDD1@G01JPEXMBKW04>
> >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> >I made a rerun of benchmark using "-S -T 30" on the server build with no assertion and
> >-O2. The numbers are the best of three successive attempts.  The patched version is
> >running with cache_target_memory = 0, cache_prune_min_age = 600 and
> >cache_entry_limit = 0 but pruning doesn't happen by the workload.
> >
> >master: 13393 tps
> >v12   : 12625 tps (-6%)
> >
> >Significant degradation is found.
> >
> >Recuded frequency of dlist_move_tail by taking 1ms interval between two succesive
> >updates on the same entry let the degradation dissapear.
> >
> >patched  : 13720 tps (+2%)
> 
> It would be good to introduce some interval.
> I followed your benchmark (initialized scale factor=10 and others are same option) 
> and found the same tendency. 
> These are average of 5 trials.
> master:   7640.000538 
> patch_v12:7417.981378 (3 % down against master)
> patch_v13:7645.071787 (almost same as master)

Thank you for cross checking.

> These cases are not pruning happen workload as you mentioned.
> I'd like to do benchmark of cache-pruning-case as well.
> To demonstrate cache-pruning-case
> right now I'm making hundreds of partitioned table and run select query for each partitioned table
> using pgbench custom file. Maybe using small number of cache_prune_min_age or hard limit would be better.
> Are there any good model?

As per Tomas' benchmark, it doesn't seem to harm for the case.

> ># I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..
> How about MIN_LRU_UPDATE_INTERVAL?

Looks fine. Fixed in the next version. Thank you for the suggestion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Thank you for testing and the commits, Tomas.

At Sat, 9 Feb 2019 19:09:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<74386116-0bc5-84f2-e614-0cff19aca2de@2ndquadrant.com>
> On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote:
> > At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp>
 
> I've done a bunch of benchmarks on v13, and I don't see any serious
> regression either. Each test creates a number of tables (100, 1k, 10k,
> 100k and 1M) and then runs SELECT queries on them. The tables are
> accessed randomly - with either uniform or exponential distribution. For
> each combination there are 5 runs, 60 seconds each (see the attached
> shell scripts, it should be pretty obvious).
> 
> I've done the tests on two different machines - small one (i5 with 8GB
> of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is
> almost exactly the same (with the exception of 1M tables, which does not
> fit into RAM on the smaller one).
> 
> On the xeon, the results (throughput compared to master) look like this:
> 
> 
>     uniform           100     1000    10000   100000   1000000
>    ------------------------------------------------------------
>     v13           105.04%  100.28%  102.96%  102.11%   101.54%
>     v13 (nodata)   97.05%   98.30%   97.42%   96.60%   107.55%
> 
> 
>     exponential       100     1000    10000   100000   1000000
>    ------------------------------------------------------------
>     v13           100.04%  103.48%  101.70%   98.56%   103.20%
>     v13 (nodata)   97.12%   98.43%   98.86%   98.48%   104.94%
> 
> The "nodata" case means the tables were empty (so no files created),
> while in the other case each table contained 1 row.
> 
> Per the results it's mostly break even, and in some cases there is
> actually a measurable improvement.

Great! I guess it comes from reduced size of hash?

> That being said, the question is whether the patch actually reduces
> memory usage in a useful way - that's not something this benchmark
> validates. I plan to modify the tests to make pgbench script
> time-dependent (i.e. to pick a subset of tables depending on time).

Thank you.

> A couple of things I've happened to notice during a quick review:
> 
> 1) The sgml docs in 0002 talk about "syscache_memory_target" and
> "syscache_prune_min_age", but those options were renamed to just
> "cache_memory_target" and "cache_prune_min_age".

I'm at a loss how call syscache for users. I think it is "catalog
cache". The most basic component is called catcache, which is
covered by the syscache layer, both of then are not revealed to
users, and it is shown to user as "catalog cache".

"catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
exists) "catalog_cache_entry_limit" and
"catalog_cache_prune_ratio" make sense?

> 2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
> defined three times in guc.c for some reason.

It is just PoC, added to show how it looks. (The multiple
instances must bex a result of a convulsion of my fingers..) I
think this is not useful unless it can be specfied per-relation
or per-cache basis. I'll remove the GUC and add reloptions for
the purpose. (But it won't work for pg_class and pg_attribute
for now).

> 3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead
> of just using two bool variables prune_by_age and prune_by_number doing
> the same thing.

Agreed. It's a kind of memory-stingy, which is useless there.

> 4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that
> pretty much mean long-running statements will set the lastaccess to very
> old timestamp? Also, it means that long-running statements (like a PL
> function accessing a bunch of tables) won't do any eviction at all, no?
> AFAICS we'll set the timestamp only once, at the very beginning.
> 
> I wonder whether using some other timestamp source (like a timestamp
> updated regularly from a timer, or something like that).

I didin't consider planning that happen within a function. If
5min is the default for catalog_cache_prune_min_age, 10% of it
(30s) seems enough and gettieofday() with such intervals wouldn't
affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
than fixed value 30s and 1s as the minimal.

I obeserved significant degradation by setting up timer at every
statement start. The patch is doing the followings to get rid of
the degradation.

(1) Every statement updates the catcache timestamp as currently
    does.  (SetCatCacheClock)

(2) The timestamp is also updated periodically using timer
   separately from (1). The timer starts if not yet at the time
   of (1).  (SetCatCacheClock, UpdateCatCacheClock)

(3) Statement end and transaction end don't stop the timer, to
   avoid overhead of setting up a timer. (

(4) But it stops by error. I choosed not to change the thing in
    PostgresMain that it kills all timers on error.

(5) Also changing the GUC catalog_cache_prune_min_age kills the
   timer, in order to reflect the change quickly especially when
   it is shortened.

> 5) There are two fread() calls in 0003 triggering a compiler warning
> about unused return value.

Ugg. It's in PoC style... (But my compiler didn't complain about
it) Maybe fixed.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Tue, 12 Feb 2019 01:02:39 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1FB972A6@G01JPEXMBYT05>
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> > Recuded frequency of dlist_move_tail by taking 1ms interval between two
> > succesive updates on the same entry let the degradation dissapear.
> > 
> > patched  : 13720 tps (+2%)
> 
> What do you think contributed to this performance increase?  Or do you hink this is just a measurement variation?
> 
> Most of my previous comments also seem to apply to v13, so let me repost them below:
> 
> 
> (1)
> 
> (1)
> +/* GUC variable to define the minimum age of entries that will be cosidered to
> +    /* initilize catcache reference clock if haven't done yet */
> 
> cosidered -> considered
> initilize -> initialize

Fixed. I found "databsae", "temprary", "resturns",
"If'force'"(missing space), "aginst", "maintan". And all fixed.

> I remember I saw some other wrong spelling and/or missing words, which I forgot (sorry).

Thank you for pointing some of them.

> (2)
> Only the doc prefixes "sys" to the new parameter names.  Other places don't have it.  I think we should prefix sys,
becauserelcache and plancache should be configurable separately because of their different usage patterns/lifecycle.
 

I tend to agree. They are already removed in this patchset. The
names are changed to "catalog_cache_*" in the new version.

> (3)
> The doc doesn't describe the unit of syscache_memory_target.  Kilobytes?

Added.

> (4)
> +    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
> +        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
> +        tupsize = sizeof(CatCTup);
> 
> GetMemoryChunkSpace() should be used to include the memory context overhead.  That's what the files in
src/backend/utils/sort/do.
 

Thanks. Done. Include bucket and cache header part but still
excluding clist.  Renamed from tupsize to memusage.

> (5)
> +            if (entry_age > cache_prune_min_age)
> 
> ">=" instead of ">"?

I didn't get it serious, but it is better. Fixed.

> (6)
> +                    if (!ct->c_list || ct->c_list->refcount == 0)
> +                    {
> +                        CatCacheRemoveCTup(cp, ct);
> 
> It's better to write "ct->c_list == NULL" to follow the style in this file.
> 
> "ct->refcount == 0" should also be checked prior to removing the catcache tuple, just in case the tuple hasn't been
releasedfor a long time, which might hardly happen.
 

Yeah, I fixed it in v12. This no longer removes an entry in
use. (if (c_list) is used in the file.)

> (7)
> CatalogCacheCreateEntry
> 
> +    int            tupsize = 0;
>      if (ntp)
>      {
>          int            i;
> +        int            tupsize;
> 
> tupsize is defined twice.

The second tupsize was bogus, but the first is removed in this
version. Now memory usage of an entry is calculated as a chunk
size.

> (8)
> CatalogCacheCreateEntry
> 
> In the negative entry case, the memory allocated by CatCacheCopyKeys() is not counted.  I'm afraid that's not
negligible.

Right. Fixed.

> (9)
> The memory for CatCList is not taken into account for syscache_memory_target.

Yeah, this is intensional since CatCacheList is short lived. Comment added.

|     * Don't waste a time by counting the list in catcache memory usage,
|     * since a list doesn't persist for a long time
|     */
|    cl = (CatCList *)
|      palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *));


Please fine the attached, which is the new version v14 addressing
Tomas', Ideriha-san and your comments.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 3b24233b1891b967ccac65a4d21ed0207037578b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/3] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
     dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+    /* fast path if it's already at the tail */
+    if (head->head.prev == node)
+        return;
+
+    dlist_delete(node);
+    dlist_push_tail(head, node);
+
+    dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

From 5031833af1777c4c6a6bf8daf32b6a3f428ccd79 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 2/3] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.

This also can put a hard limit on the number of catcache entries.
---
 doc/src/sgml/config.sgml                      |  41 ++++
 src/backend/tcop/postgres.c                   |  13 ++
 src/backend/utils/cache/catcache.c            | 285 +++++++++++++++++++++++++-
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 +
 src/backend/utils/misc/guc.c                  |  43 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/miscadmin.h                       |   1 +
 src/include/utils/catcache.h                  |  49 ++++-
 src/include/utils/timeout.h                   |   1 +
 10 files changed, 440 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 07b847a8e9..71d784b6fe 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1661,6 +1661,47 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age">
+      <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_prune_min_age</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+
+        Specifies the minimum amount of unused time in seconds at which a
+        catalog cache entry is considered to be removed. -1 indicates that
+        this feature is disabled at all. The value defaults to 300 seconds
+        (<literal>5 minutes</literal>). The catalog cache entries that are
+        not used for the duration can be removed to prevent bloat. This
+        behavior is suppressed until the size of a catalog cache exceeds
+        <xref linkend="guc-catalog-cache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-catalog-cache-memory-target" xreflabel="catalog_cache_memory_target">
+      <term><varname>catalog_cache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>syscache_memory_target</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which syscache is expanded
+        without pruning in kilobytes. The value defaults to 0, indicating that
+        pruning is always considered. After exceeding this size, catalog cache
+        pruning is considered according to
+        <xref linkend="guc-catalog-cache-prune-min-age"/>. If you need to keep
+        certain amount of catalog cache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cfd507b2..f192ee2ca6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2584,6 +2585,7 @@ start_xact_command(void)
      * not desired, the timeout has to be disabled explicitly.
      */
     enable_statement_timeout();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 static void
@@ -3159,6 +3161,14 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (CatcacheClockTimeoutPending)
+    {
+        CatcacheClockTimeoutPending = 0;
+
+        /* Update timetamp then set up the next timeout */
+        UpdateCatCacheClock();
+    }
 }
 
 
@@ -4021,6 +4031,9 @@ PostgresMain(int argc, char *argv[],
         QueryCancelPending = false; /* second to avoid race condition */
         stmt_timeout_active = false;
 
+        /* get sync with the timer state */
+        catcache_clock_timeout_active = false;
+
         /* Not reading from the client anymore. */
         DoingCommandRead = false;
 
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 258a1d64cc..0195e19976 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -39,6 +39,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -71,9 +72,43 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/* GUC variable to define the minimum age of entries that will be considered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int catalog_cache_prune_min_age = 300;
+
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int catalog_cache_memory_target = 0;
+
+/*
+ * GUC for limit by the number of entries. Entries are removed when the number
+ * of them goes above catalog_cache_entry_limit and leaving newer entries by
+ * the ratio specified by catalog_cache_prune_ratio.
+ */
+int catalog_cache_entry_limit = 0;
+double catalog_cache_prune_ratio = 0.8;
+
+/*
+ * Flag to keep track of whether catcache clock timer is active.
+ */
+bool catcache_clock_timeout_active = false;
+
+/*
+ * Minimum interval between two success move of a cache entry in LRU list,
+ * in microseconds.
+ */
+#define MIN_LRU_UPDATE_INTERVAL 100000    /* 100ms */
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock used to record the last accessed time of a catcache record. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -481,6 +516,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 
     /* delink from linked list */
     dlist_delete(&ct->cache_elem);
+    dlist_delete(&ct->lru_node);
 
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
@@ -490,6 +526,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_memusage -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -841,7 +878,13 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    cp->cc_memusage =
+        CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext,
+                                                     cp) +
+        CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext,
+                                                     cp->cc_bucket);
 
+    dlist_init(&cp->cc_lru_list);
     /*
      * new cache is initialized as far as we can go for now. print some
      * debugging information, if appropriate.
@@ -858,9 +901,191 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * helper routine for SetCatCacheClock and UpdateCatCacheClockTimer.
+ *
+ * We need to maintain the catcache clock during a long query.
+ */
+void
+SetupCatCacheClockTimer(void)
+{
+    long delay;
+
+    /* stop timer if not needed */
+    if (catalog_cache_prune_min_age == 0)
+    {
+        catcache_clock_timeout_active = false;
+        return;
+    }
+
+    /* One 10th of the variable, in milliseconds */
+    delay  = catalog_cache_prune_min_age * 1000/10;
+
+    /* Lower limit is 1 second */
+    if (delay < 1000)
+        delay = 1000;
+
+    enable_timeout_after(CATCACHE_CLOCK_TIMEOUT, delay);
+
+    catcache_clock_timeout_active = true;
+}
+
+/*
+ * Update catcacheclock: this is intended to be called from
+ * CATCACHE_CLOCK_TIMEOUT. The interval is expected more than 1 second (see
+ * above), so GetCurrentTime() doesn't harm.
+ */
+void
+UpdateCatCacheClock(void)
+{
+    catcacheclock = GetCurrentTimestamp();
+    SetupCatCacheClockTimer();
+}
+
+/*
+ * It may take an unexpectedly long time before the next clock update when
+ * catalog_cache_prune_min_age gets shorter. Disabling the current timer let
+ * the next update happen at the expected interval. We don't necessariry
+ * require this for increase the age but we don't need to avoid to disable
+ * either.
+ */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    if (catcache_clock_timeout_active)
+        disable_timeout(CATCACHE_CLOCK_TIMEOUT, false);
+
+    catcache_clock_timeout_active = false;
+}
+
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had less access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            nremoved = 0;
+    size_t        hash_size;
+    int            nelems_before = cp->cc_ntup;
+    int            ndelelems = 0;
+    bool        prune_by_age = false;
+    bool        prune_by_number = false;
+    dlist_mutable_iter    iter;
+
+    if (catalog_cache_prune_min_age >= 0)
+    {
+        /* prune only if the size of the hash is above the target */
+
+        hash_size = cp->cc_nbuckets * sizeof(dlist_head);
+        if (hash_size + cp->cc_memusage >
+            (Size) catalog_cache_memory_target * 1024L)
+            prune_by_age = true;
+    }
+
+    if (catalog_cache_entry_limit > 0 &&
+        nelems_before >= catalog_cache_entry_limit)
+    {
+        ndelelems = nelems_before -
+            (int) (catalog_cache_entry_limit * catalog_cache_prune_ratio);
+
+        /* an arbitrary lower limit.. */
+        if (ndelelems < 256)
+            ndelelems = 256;
+        if (ndelelems > nelems_before)
+            ndelelems = nelems_before;
+
+        prune_by_number = true;
+    }
+
+    /* Return immediately if no pruning is wanted */
+    if (!prune_by_age && !prune_by_number)
+        return false;
+
+    /* Scan over LRU to find entries to remove */
+    dlist_foreach_modify(iter, &cp->cc_lru_list)
+    {
+        CatCTup    *ct = dlist_container(CatCTup, lru_node, iter.cur);
+        bool        remove_this = false;
+
+        /* We don't remove referenced entry */
+        if (ct->refcount != 0 ||
+            (ct->c_list && ct->c_list->refcount != 0))
+            continue;
+
+        /* check against age */
+        if (prune_by_age)
+        {
+            long    entry_age;
+            int        us;
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            if (entry_age < catalog_cache_prune_min_age)
+            {
+                /* no longer have a business with further entries, exit  */
+                prune_by_age = false;
+                break;
+            }
+            /*
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (ct->naccess > 0)
+                ct->naccess--;
+            else 
+                remove_this = true;
+        }
+
+        /* check against entry number */
+        if (prune_by_number)
+        {
+            if (nremoved < ndelelems)
+                remove_this = true;
+            else
+                prune_by_number = false; /* we're satisfied */
+        }
+
+        /* exit immediately if all finished */
+        if (!prune_by_age && !prune_by_number)
+            break;
+
+        /* do the work */
+        if (remove_this)
+        {
+            CatCacheRemoveCTup(cp, ct);
+            nremoved++;
+        }
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, nelems_before);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -878,6 +1103,13 @@ RehashCatCache(CatCache *cp)
     newnbuckets = cp->cc_nbuckets * 2;
     newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head));
 
+    /* recalculate memory usage from the first */
+    cp->cc_memusage =
+        CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext,
+                                                     cp) +
+        CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext,
+                                                     newbucket);
+
     /* Move all entries from old hash table to new. */
     for (i = 0; i < cp->cc_nbuckets; i++)
     {
@@ -890,6 +1122,7 @@ RehashCatCache(CatCache *cp)
 
             dlist_delete(iter.cur);
             dlist_push_head(&newbucket[hashIndex], &ct->cache_elem);
+            cp->cc_memusage += ct->size;
         }
     }
 
@@ -1274,6 +1507,21 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        /*
+         * We don't want too frequent update of
+         * LRU. catalog_cache_prune_min_age can be changed on-session so we
+         * need to maintain the LRU regardless of catalog_cache_prune_min_age.
+         */
+        if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL)
+        {
+            ct->lastaccess = catcacheclock;
+            dlist_move_tail(&cache->cc_lru_list, &ct->lru_node);
+        }
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1709,6 +1957,11 @@ SearchCatCacheList(CatCache *cache,
         /* Now we can build the CatCList entry. */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
         nmembers = list_length(ctlist);
+
+        /*
+         * Don't waste a time by counting the list in catcache memory usage,
+         * since it doesn't live a long life.
+         */
         cl = (CatCList *)
             palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *));
 
@@ -1824,6 +2077,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1842,8 +2096,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
@@ -1877,7 +2131,6 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
         ct = (CatCTup *) palloc(sizeof(CatCTup));
-
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1898,18 +2151,38 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    dlist_push_tail(&cache->cc_lru_list, &ct->lru_node);
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
+    ct->size = 
+        CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext,
+                                                         ct);
+    cache->cc_memusage += ct->size;
+
+    /* increase refcount so that this survives pruning */
+    ct->refcount++;
+
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
+    /* we may still want to prune by entry number, check it */
+    else if (catalog_cache_entry_limit > 0 &&
+             cache->cc_ntup > catalog_cache_entry_limit)
+        CatCacheCleanupOldEntries(cache);
+
+    ct->refcount--;
 
     return ct;
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..0e8b972a29 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t CatcacheClockTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..9eb50e9676 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void CatcacheClockTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(CATCACHE_CLOCK_TIMEOUT,
+                        CatcacheClockTimeoutHandler);
     }
 
     /*
@@ -1238,6 +1241,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+CatcacheClockTimeoutHandler(void)
+{
+    CatcacheClockTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 41d477165c..c62d5ad8b8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -81,6 +81,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2205,6 +2206,38 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Catalog cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
+    {
+        {"catalog_cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Time-based cache pruning starts working after exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &catalog_cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"catalog_cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &catalog_cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
@@ -3368,6 +3401,16 @@ static struct config_real ConfigureNamesReal[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_ratio", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Reduce ratio of pruning caused by catalog_cache_entry_limit."),
+             NULL
+        },
+        &catalog_cache_prune_ratio,
+        0.8, 0.0, 1.0,
+        NULL, NULL, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ad6c436f93..aeb5968e75 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#catalog_cache_memory_target = 0kB    # in kB
+#catalog_cache_prune_min_age = 300s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..33b800e80f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..0425fc0786 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,10 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    dlist_head    cc_lru_list;
+    int            cc_memusage;    /* memory usage of this catcache (excluding
+                                 * header part) */
+    int            cc_nfreeent;    /* # of entries currently not referenced */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +124,10 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    dlist_node    lru_node;        /* LRU node */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +197,45 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+extern int catalog_cache_memory_target;
+extern int catalog_cache_entry_limit;
+extern double catalog_cache_prune_ratio;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * Flag to keep track of whether catcache timestamp timer is active.
+ */
+extern bool catcache_clock_timeout_active;
+
+/* catcache prune time helper functions  */
+extern void SetupCatCacheClockTimer(void);
+extern void UpdateCatCacheClock(void);
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record and start
+ * maintenance timer if needed. We keep to update the clock even while pruning
+ * is disable so that we are not confused by bogus clock value.
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+
+    if (!catcache_clock_timeout_active && catalog_cache_prune_min_age > 0)
+        SetupCatCacheClockTimer();
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..b2d97b4f7b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    CATCACHE_CLOCK_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From ea9d43f623d093bc1276fd1d5480e5cff6097d60 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Feb 2019 20:31:16 +0900
Subject: [PATCH 3/3] Syscache usage tracking feature

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  16 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 201 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            |  93 +++++++++---
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   6 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 564 insertions(+), 36 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 71d784b6fe..2eceec1d94 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6703,6 +6703,22 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-catalog-cache-usage-interval" xreflabel="track_catalog_cache_usage_interval">
+      <term><varname>track_catalog_cache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_catlog_cache_usage_interval</varname>
+       configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect catalog cache usage statistics on
+        the session in milliseconds. This parameter is 0 by default, which
+        means disabled.  Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..f5d1aaf96f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..8c4ab0aef9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -124,6 +125,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only database stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_remove_syscache_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -630,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /* clear syscache statistics files and temporary settings */
+    if (MyBackendId != InvalidBackendId)
+        pgstat_remove_syscache_statsfile();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* removes syscache stats files of this backend */
+static void
+pgstat_remove_syscache_statsfile(void)
+{
+    char    fname[MAXPGPATH];
+
+    pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                     fname, MAXPGPATH);
+    unlink(fname);        /* don't care of the result */
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and returns the
+ * time remaining in the current interval in milliseconds. If 'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    {
+        /* disabled. remove the statistics file if any */
+        if (last_report > 0)
+        {
+            last_report = 0;
+            pgstat_remove_syscache_statsfile();
+        }
+        return 0;
+    }
+    
+    /* Check against the interval */
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now update the stats */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f192ee2ca6..d0afee189f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3159,6 +3159,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 
@@ -3743,6 +3749,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_syscache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4186,9 +4193,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_syscache_update_timeout = true;
+                    enable_timeout_after(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4231,6 +4248,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_syscache_update_timeout)
+        {
+            disable_timeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_syscache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..a314f431c6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        if (fread(&cacheid, sizeof(int), 1, fpin) != 1 ||
+            fread(&last_update, sizeof(TimestampTz), 1, fpin) != 1 ||
+            fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 0195e19976..fd84e35a6a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -109,6 +109,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Clock used to record the last accessed time of a catcache record. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -640,9 +644,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -718,9 +720,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -1032,10 +1032,10 @@ CatCacheCleanupOldEntries(CatCache *cp)
             int        us;
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
@@ -1463,9 +1463,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1535,9 +1533,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1546,9 +1542,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1676,9 +1670,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1789,9 +1781,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1848,9 +1838,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2373,3 +2361,68 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_memusage;
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /*
+     * catalog_cache_prune_min_age can be changed on-session, fill it every
+     * time
+     */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] =
+            (int) (catalog_cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * catalog_cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 0e8b972a29..b7c647b5e0 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t CatcacheClockTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 9eb50e9676..2f3251e8d5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
 static void CatcacheClockTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
                         IdleInTransactionSessionTimeoutHandler);
         RegisterTimeout(CATCACHE_CLOCK_TIMEOUT,
                         CatcacheClockTimeoutHandler);
+        RegisterTimeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1249,6 +1252,14 @@ CatcacheClockTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c62d5ad8b8..7f1670fa5b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3178,6 +3178,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_catalog_cache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aeb5968e75..797f52fa2a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -556,6 +556,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_catlog_cache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 24f99f7fc4..fc35b6be47 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9689,6 +9689,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 33b800e80f..767c94a63c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..b6bfd7d644 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 0425fc0786..8e477090e2 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -68,10 +68,8 @@ typedef struct catcache
     int            cc_nfreeent;    /* # of entries currently not referenced */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -84,7 +82,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -275,4 +272,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index b2d97b4f7b..0677978923 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -32,6 +32,7 @@ typedef enum TimeoutId
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
     CATCACHE_CLOCK_TIMEOUT,
+    IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2c8e21baa7..7bd77e9972 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:
On 2/12/19 12:35 PM, Kyotaro HORIGUCHI wrote:
> Thank you for testing and the commits, Tomas.
> 
> At Sat, 9 Feb 2019 19:09:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<74386116-0bc5-84f2-e614-0cff19aca2de@2ndquadrant.com>
>> On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote:
>>> At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> I've done a bunch of benchmarks on v13, and I don't see any serious
>> regression either. Each test creates a number of tables (100, 1k, 10k,
>> 100k and 1M) and then runs SELECT queries on them. The tables are
>> accessed randomly - with either uniform or exponential distribution. For
>> each combination there are 5 runs, 60 seconds each (see the attached
>> shell scripts, it should be pretty obvious).
>>
>> I've done the tests on two different machines - small one (i5 with 8GB
>> of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is
>> almost exactly the same (with the exception of 1M tables, which does not
>> fit into RAM on the smaller one).
>>
>> On the xeon, the results (throughput compared to master) look like this:
>>
>>
>>     uniform           100     1000    10000   100000   1000000
>>    ------------------------------------------------------------
>>     v13           105.04%  100.28%  102.96%  102.11%   101.54%
>>     v13 (nodata)   97.05%   98.30%   97.42%   96.60%   107.55%
>>
>>
>>     exponential       100     1000    10000   100000   1000000
>>    ------------------------------------------------------------
>>     v13           100.04%  103.48%  101.70%   98.56%   103.20%
>>     v13 (nodata)   97.12%   98.43%   98.86%   98.48%   104.94%
>>
>> The "nodata" case means the tables were empty (so no files created),
>> while in the other case each table contained 1 row.
>>
>> Per the results it's mostly break even, and in some cases there is
>> actually a measurable improvement.
> 
> Great! I guess it comes from reduced size of hash?
> 

Not sure about that. I haven't actually verified that it reduces the
cache size at all - I was measuring the overhead of the extra work. And
I don't think the syscache actually shrunk significantly, because the
throughput was quite high (~15-30k tps, IIRC) so pretty much everything
was touched within the default 600 seconds.

>> That being said, the question is whether the patch actually reduces
>> memory usage in a useful way - that's not something this benchmark
>> validates. I plan to modify the tests to make pgbench script
>> time-dependent (i.e. to pick a subset of tables depending on time).
> 
> Thank you.
> 
>> A couple of things I've happened to notice during a quick review:
>>
>> 1) The sgml docs in 0002 talk about "syscache_memory_target" and
>> "syscache_prune_min_age", but those options were renamed to just
>> "cache_memory_target" and "cache_prune_min_age".
> 
> I'm at a loss how call syscache for users. I think it is "catalog
> cache". The most basic component is called catcache, which is
> covered by the syscache layer, both of then are not revealed to
> users, and it is shown to user as "catalog cache".
> 
> "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
> exists) "catalog_cache_entry_limit" and
> "catalog_cache_prune_ratio" make sense?
> 

I think "catalog_cache" sounds about right, although my point was simply
that there's a discrepancy between sgml docs and code.

>> 2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
>> defined three times in guc.c for some reason.
> 
> It is just PoC, added to show how it looks. (The multiple
> instances must bex a result of a convulsion of my fingers..) I
> think this is not useful unless it can be specfied per-relation
> or per-cache basis. I'll remove the GUC and add reloptions for
> the purpose. (But it won't work for pg_class and pg_attribute
> for now).
> 

OK, although I'd just keep it as simple as possible. TBH I can't really
imagine users tuning limits for individual caches in any meaningful way.

>> 3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead
>> of just using two bool variables prune_by_age and prune_by_number doing
>> the same thing.
> 
> Agreed. It's a kind of memory-stingy, which is useless there.
> 
>> 4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that
>> pretty much mean long-running statements will set the lastaccess to very
>> old timestamp? Also, it means that long-running statements (like a PL
>> function accessing a bunch of tables) won't do any eviction at all, no?
>> AFAICS we'll set the timestamp only once, at the very beginning.
>>
>> I wonder whether using some other timestamp source (like a timestamp
>> updated regularly from a timer, or something like that).
> 
> I didin't consider planning that happen within a function. If
> 5min is the default for catalog_cache_prune_min_age, 10% of it
> (30s) seems enough and gettieofday() with such intervals wouldn't
> affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
> than fixed value 30s and 1s as the minimal.
> 

Actually, I see CatCacheCleanupOldEntries contains this comment:

/*
 * Calculate the duration from the time of the last access to the
 * "current" time. Since catcacheclock is not advanced within a
 * transaction, the entries that are accessed within the current
 * transaction won't be pruned.
 */

which I think is pretty much what I've been saying ... But the question
is whether we need to do something about it.

> I obeserved significant degradation by setting up timer at every
> statement start. The patch is doing the followings to get rid of
> the degradation.
> 
> (1) Every statement updates the catcache timestamp as currently
>     does.  (SetCatCacheClock)
> 
> (2) The timestamp is also updated periodically using timer
>    separately from (1). The timer starts if not yet at the time
>    of (1).  (SetCatCacheClock, UpdateCatCacheClock)
> 
> (3) Statement end and transaction end don't stop the timer, to
>    avoid overhead of setting up a timer. (
> 
> (4) But it stops by error. I choosed not to change the thing in
>     PostgresMain that it kills all timers on error.
> 
> (5) Also changing the GUC catalog_cache_prune_min_age kills the
>    timer, in order to reflect the change quickly especially when
>    it is shortened.
> 

Interesting. What was the frequency of the timer / how often was it
executed? Can you share the code somehow?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> I'm at a loss how call syscache for users. I think it is "catalog
> cache". The most basic component is called catcache, which is
> covered by the syscache layer, both of then are not revealed to
> users, and it is shown to user as "catalog cache".
> 
> "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
> exists) "catalog_cache_entry_limit" and
> "catalog_cache_prune_ratio" make sense?

PostgreSQL documentation uses "system catalog" in its table of contents, so syscat_cache_xxx would be a bit more
familiar? I'm for either catalog_ and syscat_, but what name shall we use for the relation cache?  catcache and
relcachehave different element sizes and possibly different usage patterns, so they may as well have different
parametersjust like MySQL does.  If we follow that idea, then the name would be relation_cache_xxx.  However, from the
user'sviewpoint, the relation cache is also created from the system catalog like pg_class and pg_attribute...
 


Regards
Takayuki Tsunakawa






RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
> > I didin't consider planning that happen within a function. If
> > 5min is the default for catalog_cache_prune_min_age, 10% of it
> > (30s) seems enough and gettieofday() with such intervals wouldn't
> > affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
> > than fixed value 30s and 1s as the minimal.
> >
> 
> Actually, I see CatCacheCleanupOldEntries contains this comment:
> 
> /*
>  * Calculate the duration from the time of the last access to the
>  * "current" time. Since catcacheclock is not advanced within a
>  * transaction, the entries that are accessed within the current
>  * transaction won't be pruned.
>  */
> 
> which I think is pretty much what I've been saying ... But the question
> is whether we need to do something about it.

Hmm, I'm surprised at v14 patch about this.  I remember that previous patches renewed the cache clock on every
statement,and it is correct.  If the cache clock is only updated at the beginning of a transaction, the following TODO
itemwould not be solved:
 

https://wiki.postgresql.org/wiki/Todo

" Reduce memory use when analyzing many tables in a single command by making catcache and syscache flushable or
bounded."

Also, Tom mentioned pg_dump in this thread (protect syscache...).  pg_dump runs in a single transaction, touching all
systemcatalogs.  That may result in OOM, and this patch can rescue it.
 


Regards
Takayuki Tsunakawa





Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Tue, 12 Feb 2019 20:36:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190212.203628.118792892.horiguchi.kyotaro@lab.ntt.co.jp>
> > (4)
> > +    hash_size = cp->cc_nbuckets * sizeof(dlist_head);
> > +        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
> > +        tupsize = sizeof(CatCTup);
> > 
> > GetMemoryChunkSpace() should be used to include the memory context overhead.  That's what the files in
src/backend/utils/sort/do.
 
> 
> Thanks. Done. Include bucket and cache header part but still
> excluding clist.  Renamed from tupsize to memusage.

It is too complex as I was afraid. The indirect calls causes
siginicant degradation. (Anyway the previous code was bogus in
that it passes CACHELINEALIGN'ed pointer to get_chunk_size..)

Instead, I added an accounting(?) interface function.

| MemoryContextGettConsumption(MemoryContext cxt);

The API returns the current consumption in this memory
context. This allows "real" memory accounting almost without
overhead.

(1) New patch v15-0002 adds accounting feature to MemoryContext.
  (It adds this feature only to AllocSet, if this is acceptable
  it can be extended to other allocators.)

(2) Another new patch v15-0005 on top of previous design of
  limit-by-number-of-a-cache feature converts it to
  limit-by-size-on-all-caches feature, which I think is
  Tsunakawa-san wanted.

As far as I can see no significant degradation is found in usual
(as long as pruning doesn't happen) code paths.

About the new global-size based evicition(2), cache entry
creation becomes slow after the total size reached to the limit
since every one new entry evicts one or more old (=
not-recently-used) entries. Because of not needing knbos for each
cache, it become far realistic. So I added documentation of
"catalog_cache_max_size" in 0005.

About the age-based eviction, the bulk eviction seems to take a a
bit long time but it happnes instead of hash resizing so the user
doesn't observe additional slowdown. On the contrary the pruning
can avoid rehashing scanning the whole cache. I think it is the
gain seen in the Tomas' experiment.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 3b24233b1891b967ccac65a4d21ed0207037578b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/5] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
     dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+    /* fast path if it's already at the tail */
+    if (head->head.prev == node)
+        return;
+
+    dlist_delete(node);
+    dlist_push_tail(head, node);
+
+    dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

From ade1f6bf389d834cd4428f302a5cc4deaf66be9e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 13 Feb 2019 13:36:38 +0900
Subject: [PATCH 2/5] Memory consumption report reature of memorycontext

This adds a feature that count memory consumption (in other words,
internally allocated size for a chunk) and read it from others.  This
allows other features to know "(almost) real" consumption of memory.
---
 src/backend/utils/mmgr/aset.c | 13 +++++++++++++
 src/backend/utils/mmgr/mcxt.c |  1 +
 src/include/nodes/memnodes.h  |  4 ++++
 3 files changed, 18 insertions(+)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 08aff333a4..3c5798734c 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -614,6 +614,9 @@ AllocSetReset(MemoryContext context)
 
     /* Reset block size allocation sequence, too */
     set->nextBlockSize = set->initBlockSize;
+
+    /* Reset consumption account */
+    set->header.consumption = 0;
 }
 
 /*
@@ -778,6 +781,8 @@ AllocSetAlloc(MemoryContext context, Size size)
         /* Disallow external access to private part of chunk header. */
         VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
 
+        context->consumption += chunk_size;
+
         return AllocChunkGetPointer(chunk);
     }
 
@@ -817,6 +822,8 @@ AllocSetAlloc(MemoryContext context, Size size)
         /* Disallow external access to private part of chunk header. */
         VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
 
+        context->consumption += chunk->size;
+
         return AllocChunkGetPointer(chunk);
     }
 
@@ -976,6 +983,8 @@ AllocSetAlloc(MemoryContext context, Size size)
     /* Disallow external access to private part of chunk header. */
     VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
 
+    context->consumption += chunk_size;
+
     return AllocChunkGetPointer(chunk);
 }
 
@@ -1022,6 +1031,7 @@ AllocSetFree(MemoryContext context, void *pointer)
             elog(ERROR, "could not find block containing chunk %p", chunk);
 
         /* OK, remove block from aset's list and free it */
+        context->consumption -= chunk->size;
         if (block->prev)
             block->prev->next = block->next;
         else
@@ -1039,6 +1049,7 @@ AllocSetFree(MemoryContext context, void *pointer)
         int            fidx = AllocSetFreeIndex(chunk->size);
 
         chunk->aset = (void *) set->freelist[fidx];
+        context->consumption -= chunk->size;
 
 #ifdef CLOBBER_FREED_MEMORY
         wipe_mem(pointer, chunk->size);
@@ -1159,6 +1170,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
         /* Do the realloc */
         chksize = MAXALIGN(size);
         blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+        context->consumption -= oldsize;
         block = (AllocBlock) realloc(block, blksize);
         if (block == NULL)
         {
@@ -1178,6 +1190,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
         if (block->next)
             block->next->prev = block;
         chunk->size = chksize;
+        context->consumption += chksize;
 
 #ifdef MEMORY_CONTEXT_CHECKING
 #ifdef RANDOMIZE_ALLOCATED_MEMORY
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 43c58c351b..395fca9e5d 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -740,6 +740,7 @@ MemoryContextCreate(MemoryContext node,
     node->name = name;
     node->ident = NULL;
     node->reset_cbs = NULL;
+    node->consumption = 0;
 
     /* OK to link node into context tree */
     if (parent)
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index dbae98d3d9..cb0f23bac7 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -87,6 +87,7 @@ typedef struct MemoryContextData
     const char *name;            /* context name (just for debugging) */
     const char *ident;            /* context ID if any (just for debugging) */
     MemoryContextCallback *reset_cbs;    /* list of reset/delete callbacks */
+    uint64        consumption;    /* accumulates consumed memory size */
 } MemoryContextData;
 
 /* utils/palloc.h contains typedef struct MemoryContextData *MemoryContext */
@@ -105,3 +106,6 @@ typedef struct MemoryContextData
       IsA((context), GenerationContext)))
 
 #endif                            /* MEMNODES_H */
+
+/* Interface routines for memory consumption-based accounting */
+#define MemoryContextGetConsumption(c)  ((c)->consumption)
-- 
2.16.3

From 92c2a6f0c0696d1cef617115a199d09ae1fc0e76 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 3/5] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.

This also can put a hard limit on the number of catcache entries.
---
 doc/src/sgml/config.sgml                      |  40 ++++
 src/backend/tcop/postgres.c                   |  13 ++
 src/backend/utils/cache/catcache.c            | 283 +++++++++++++++++++++++++-
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 +
 src/backend/utils/misc/guc.c                  |  43 ++++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/miscadmin.h                       |   1 +
 src/include/utils/catcache.h                  |  50 ++++-
 src/include/utils/timeout.h                   |   1 +
 10 files changed, 436 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 07b847a8e9..4749ad61a9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1661,6 +1661,46 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age">
+      <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_prune_min_age</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        system catalog cache entry is removed. -1 indicates that this feature
+        is disabled at all. The value defaults to 300 seconds (<literal>5
+        minutes</literal>). The catalog cache entries that are not used for
+        the duration can be removed to prevent it from being filled up with
+        useless entries. This behaviour is muted until the size of a catalog
+        cache exceeds <xref linkend="guc-catalog-cache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-catalog-cache-memory-target" xreflabel="catalog_cache_memory_target">
+      <term><varname>catalog_cache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_memory_target</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which a system catalog cache
+        can expand without pruning in kilobytes. The value defaults to 0,
+        indicating that age-based pruning is always considered. After
+        exceeding this size, catalog cache starts pruning according to
+        <xref linkend="guc-catalog-cache-prune-min-age"/>. If you need to keep
+        certain amount of catalog cache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cfd507b2..f192ee2ca6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2584,6 +2585,7 @@ start_xact_command(void)
      * not desired, the timeout has to be disabled explicitly.
      */
     enable_statement_timeout();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 static void
@@ -3159,6 +3161,14 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (CatcacheClockTimeoutPending)
+    {
+        CatcacheClockTimeoutPending = 0;
+
+        /* Update timetamp then set up the next timeout */
+        UpdateCatCacheClock();
+    }
 }
 
 
@@ -4021,6 +4031,9 @@ PostgresMain(int argc, char *argv[],
         QueryCancelPending = false; /* second to avoid race condition */
         stmt_timeout_active = false;
 
+        /* get sync with the timer state */
+        catcache_clock_timeout_active = false;
+
         /* Not reading from the client anymore. */
         DoingCommandRead = false;
 
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 258a1d64cc..04a60a490a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -39,6 +39,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -71,9 +72,43 @@
 #define CACHE6_elog(a,b,c,d,e,f,g)
 #endif
 
+/* GUC variable to define the minimum age of entries that will be considered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int catalog_cache_prune_min_age = 300;
+
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int catalog_cache_memory_target = 0;
+
+/*
+ * GUC for limit by the number of entries. Entries are removed when the number
+ * of them goes above catalog_cache_entry_limit and leaving newer entries by
+ * the ratio specified by catalog_cache_prune_ratio.
+ */
+int catalog_cache_entry_limit = 0;
+double catalog_cache_prune_ratio = 0.8;
+
+/*
+ * Flag to keep track of whether catcache clock timer is active.
+ */
+bool catcache_clock_timeout_active = false;
+
+/*
+ * Minimum interval between two success move of a cache entry in LRU list,
+ * in microseconds.
+ */
+#define MIN_LRU_UPDATE_INTERVAL 100000    /* 100ms */
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock used to record the last accessed time of a catcache record. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -481,6 +516,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 
     /* delink from linked list */
     dlist_delete(&ct->cache_elem);
+    dlist_delete(&ct->lru_node);
 
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
@@ -490,6 +526,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_memusage -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -779,6 +816,7 @@ InitCatCache(int id,
     MemoryContext oldcxt;
     size_t        sz;
     int            i;
+    uint64        base_size;
 
     /*
      * nbuckets is the initial number of hash buckets to use in this catcache.
@@ -821,8 +859,12 @@ InitCatCache(int id,
      *
      * Note: we rely on zeroing to initialize all the dlist headers correctly
      */
+    base_size = MemoryContextGetConsumption(CacheMemoryContext);
     sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
     cp = (CatCache *) CACHELINEALIGN(palloc0(sz));
+    cp->cc_head_alloc_size =
+        MemoryContextGetConsumption(CacheMemoryContext) - base_size;
+
     cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head));
 
     /*
@@ -842,6 +884,11 @@ InitCatCache(int id,
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
 
+    /* cc_head_alloc_size + consumed size for cc_bucket */
+    cp->cc_memusage =
+        MemoryContextGetConsumption(CacheMemoryContext) - base_size;
+
+    dlist_init(&cp->cc_lru_list);
     /*
      * new cache is initialized as far as we can go for now. print some
      * debugging information, if appropriate.
@@ -858,9 +905,185 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * helper routine for SetCatCacheClock and UpdateCatCacheClockTimer.
+ *
+ * We need to maintain the catcache clock during a long query.
+ */
+void
+SetupCatCacheClockTimer(void)
+{
+    long delay;
+
+    /* stop timer if not needed */
+    if (catalog_cache_prune_min_age == 0)
+    {
+        catcache_clock_timeout_active = false;
+        return;
+    }
+
+    /* One 10th of the variable, in milliseconds */
+    delay  = catalog_cache_prune_min_age * 1000/10;
+
+    /* Lower limit is 1 second */
+    if (delay < 1000)
+        delay = 1000;
+
+    enable_timeout_after(CATCACHE_CLOCK_TIMEOUT, delay);
+
+    catcache_clock_timeout_active = true;
+}
+
+/*
+ * Update catcacheclock: this is intended to be called from
+ * CATCACHE_CLOCK_TIMEOUT. The interval is expected more than 1 second (see
+ * above), so GetCurrentTime() doesn't harm.
+ */
+void
+UpdateCatCacheClock(void)
+{
+    catcacheclock = GetCurrentTimestamp();
+    SetupCatCacheClockTimer();
+}
+
+/*
+ * It may take an unexpectedly long time before the next clock update when
+ * catalog_cache_prune_min_age gets shorter. Disabling the current timer let
+ * the next update happen at the expected interval. We don't necessariry
+ * require this for increase the age but we don't need to avoid to disable
+ * either.
+ */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    if (catcache_clock_timeout_active)
+        disable_timeout(CATCACHE_CLOCK_TIMEOUT, false);
+
+    catcache_clock_timeout_active = false;
+}
+
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had less access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            nremoved = 0;
+    int            nelems_before = cp->cc_ntup;
+    int            ndelelems = 0;
+    bool        prune_by_age = false;
+    bool        prune_by_number = false;
+    dlist_mutable_iter    iter;
+
+    /* prune only if the size of the hash is above the target */
+    if (catalog_cache_prune_min_age >= 0 &&
+        cp->cc_memusage > (Size) catalog_cache_memory_target * 1024L)
+        prune_by_age = true;
+
+    if (catalog_cache_entry_limit > 0 &&
+        nelems_before >= catalog_cache_entry_limit)
+    {
+        ndelelems = nelems_before -
+            (int) (catalog_cache_entry_limit * catalog_cache_prune_ratio);
+
+        /* an arbitrary lower limit.. */
+        if (ndelelems < 256)
+            ndelelems = 256;
+        if (ndelelems > nelems_before)
+            ndelelems = nelems_before;
+
+        prune_by_number = true;
+    }
+
+    /* Return immediately if no pruning is wanted */
+    if (!prune_by_age && !prune_by_number)
+        return false;
+
+    /* Scan over LRU to find entries to remove */
+    dlist_foreach_modify(iter, &cp->cc_lru_list)
+    {
+        CatCTup    *ct = dlist_container(CatCTup, lru_node, iter.cur);
+        bool        remove_this = false;
+
+        /* We don't remove referenced entry */
+        if (ct->refcount != 0 ||
+            (ct->c_list && ct->c_list->refcount != 0))
+            continue;
+
+        /* check against age */
+        if (prune_by_age)
+        {
+            long    entry_age;
+            int        us;
+
+            /*
+             * Calculate the duration from the time of the last access to the
+             * "current" time. Since catcacheclock is not advanced within a
+             * transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            if (entry_age < catalog_cache_prune_min_age)
+            {
+                /* no longer have a business with further entries, exit  */
+                prune_by_age = false;
+                break;
+            }
+            /*
+             * Entries that are not accessed after last pruning are removed in
+             * that seconds, and that has been accessed several times are
+             * removed after leaving alone for up to three times of the
+             * duration. We don't try shrink buckets since pruning effectively
+             * caps catcache expansion in the long term.
+             */
+            if (ct->naccess > 0)
+                ct->naccess--;
+            else 
+                remove_this = true;
+        }
+
+        /* check against entry number */
+        if (prune_by_number)
+        {
+            if (nremoved < ndelelems)
+                remove_this = true;
+            else
+                prune_by_number = false; /* we're satisfied */
+        }
+
+        /* exit immediately if all finished */
+        if (!prune_by_age && !prune_by_number)
+            break;
+
+        /* do the work */
+        if (remove_this)
+        {
+            CatCacheRemoveCTup(cp, ct);
+            nremoved++;
+        }
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, nelems_before);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -870,6 +1093,7 @@ RehashCatCache(CatCache *cp)
     dlist_head *newbucket;
     int            newnbuckets;
     int            i;
+    uint64        base_size = MemoryContextGetConsumption(CacheMemoryContext);
 
     elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets",
          cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets);
@@ -878,6 +1102,10 @@ RehashCatCache(CatCache *cp)
     newnbuckets = cp->cc_nbuckets * 2;
     newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head));
 
+    /* recalculate memory usage from the first */
+    cp->cc_memusage = cp->cc_head_alloc_size +
+        MemoryContextGetConsumption(CacheMemoryContext) - base_size;
+
     /* Move all entries from old hash table to new. */
     for (i = 0; i < cp->cc_nbuckets; i++)
     {
@@ -890,6 +1118,7 @@ RehashCatCache(CatCache *cp)
 
             dlist_delete(iter.cur);
             dlist_push_head(&newbucket[hashIndex], &ct->cache_elem);
+            cp->cc_memusage += ct->size;
         }
     }
 
@@ -1274,6 +1503,21 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        /*
+         * We don't want too frequent update of
+         * LRU. catalog_cache_prune_min_age can be changed on-session so we
+         * need to maintain the LRU regardless of catalog_cache_prune_min_age.
+         */
+        if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL)
+        {
+            ct->lastaccess = catcacheclock;
+            dlist_move_tail(&cache->cc_lru_list, &ct->lru_node);
+        }
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1709,6 +1953,11 @@ SearchCatCacheList(CatCache *cache,
         /* Now we can build the CatCList entry. */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
         nmembers = list_length(ctlist);
+
+        /*
+         * Don't waste a time by counting the list in catcache memory usage,
+         * since it doesn't live a long life.
+         */
         cl = (CatCList *)
             palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *));
 
@@ -1819,11 +2068,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    uint64        base_size = MemoryContextGetConsumption(CacheMemoryContext);
 
     /* negative entries have no tuple associated */
     if (ntp)
     {
         int            i;
+        int            tupsize;
 
         Assert(!negative);
 
@@ -1842,8 +2093,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
@@ -1877,7 +2128,6 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
         ct = (CatCTup *) palloc(sizeof(CatCTup));
-
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
@@ -1898,18 +2148,36 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    dlist_push_tail(&cache->cc_lru_list, &ct->lru_node);
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
+    ct->size = MemoryContextGetConsumption(CacheMemoryContext) - base_size;
+    cache->cc_memusage += ct->size;
+
+    /* increase refcount so that this survives pruning */
+    ct->refcount++;
+
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
+    /* we may still want to prune by entry number, check it */
+    else if (catalog_cache_entry_limit > 0 &&
+             cache->cc_ntup > catalog_cache_entry_limit)
+        CatCacheCleanupOldEntries(cache);
+
+    ct->refcount--;
 
     return ct;
 }
@@ -1940,7 +2208,7 @@ CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys)
 /*
  * Helper routine that copies the keys in the srckeys array into the dstkeys
  * one, guaranteeing that the datums are fully allocated in the current memory
- * context.
+ * context. Returns allocated memory size.
  */
 static void
 CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
@@ -1976,7 +2244,6 @@ CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                                att->attbyval,
                                att->attlen);
     }
-
 }
 
 /*
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..0e8b972a29 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t CatcacheClockTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..9eb50e9676 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void CatcacheClockTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(CATCACHE_CLOCK_TIMEOUT,
+                        CatcacheClockTimeoutHandler);
     }
 
     /*
@@ -1238,6 +1241,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+CatcacheClockTimeoutHandler(void)
+{
+    CatcacheClockTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 41d477165c..c62d5ad8b8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -81,6 +81,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2205,6 +2206,38 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Catalog cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
+    {
+        {"catalog_cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Time-based cache pruning starts working after exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &catalog_cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"catalog_cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum entries of catcache."),
+             NULL
+        },
+        &catalog_cache_entry_limit,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
@@ -3368,6 +3401,16 @@ static struct config_real ConfigureNamesReal[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_ratio", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Reduce ratio of pruning caused by catalog_cache_entry_limit."),
+             NULL
+        },
+        &catalog_cache_prune_ratio,
+        0.8, 0.0, 1.0,
+        NULL, NULL, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ad6c436f93..aeb5968e75 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#catalog_cache_memory_target = 0kB    # in kB
+#catalog_cache_prune_min_age = 300s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..33b800e80f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..0a714bf514 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,11 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    dlist_head    cc_lru_list;
+    int            cc_head_alloc_size;/* consumed memory to allocate this struct */
+    int            cc_memusage;    /* memory usage of this catcache (excluding
+                                 * header part) */
+    int            cc_nfreeent;    /* # of entries currently not referenced */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +125,10 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    dlist_node    lru_node;        /* LRU node */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +198,45 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+extern int catalog_cache_memory_target;
+extern int catalog_cache_entry_limit;
+extern double catalog_cache_prune_ratio;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * Flag to keep track of whether catcache timestamp timer is active.
+ */
+extern bool catcache_clock_timeout_active;
+
+/* catcache prune time helper functions  */
+extern void SetupCatCacheClockTimer(void);
+extern void UpdateCatCacheClock(void);
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record and start
+ * maintenance timer if needed. We keep to update the clock even while pruning
+ * is disable so that we are not confused by bogus clock value.
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+
+    if (!catcache_clock_timeout_active && catalog_cache_prune_min_age > 0)
+        SetupCatCacheClockTimer();
+}
+
+static inline TimestampTz
+GetCatCacheClock(void)
+{
+    return catcacheclock;
+}
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..b2d97b4f7b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    CATCACHE_CLOCK_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From e4269e14958596676c2c1f0303ca171a88ae83f7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 12 Feb 2019 20:31:16 +0900
Subject: [PATCH 4/5] Syscache usage tracking feature

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  16 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 201 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 134 +++++++++++++++++
 src/backend/utils/cache/catcache.c            |  93 +++++++++---
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   6 +-
 src/include/utils/catcache.h                  |   9 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 564 insertions(+), 36 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4749ad61a9..bc2bef0878 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6702,6 +6702,22 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-catalog-cache-usage-interval" xreflabel="track_catalog_cache_usage_interval">
+      <term><varname>track_catalog_cache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_catlog_cache_usage_interval</varname>
+       configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect catalog cache usage statistics on
+        the session in milliseconds. This parameter is 0 by default, which
+        means disabled.  Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..f5d1aaf96f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..8c4ab0aef9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -124,6 +125,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only database stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_remove_syscache_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -630,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+            
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /* clear syscache statistics files and temporary settings */
+    if (MyBackendId != InvalidBackendId)
+        pgstat_remove_syscache_statsfile();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* removes syscache stats files of this backend */
+static void
+pgstat_remove_syscache_statsfile(void)
+{
+    char    fname[MAXPGPATH];
+
+    pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                     fname, MAXPGPATH);
+    unlink(fname);        /* don't care of the result */
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and returns the
+ * time remaining in the current interval in milliseconds. If 'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    {
+        /* disabled. remove the statistics file if any */
+        if (last_report > 0)
+        {
+            last_report = 0;
+            pgstat_remove_syscache_statsfile();
+        }
+        return 0;
+    }
+    
+    /* Check against the interval */
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now update the stats */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+        
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f192ee2ca6..d0afee189f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3159,6 +3159,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 
@@ -3743,6 +3749,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_syscache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4186,9 +4193,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_syscache_update_timeout = true;
+                    enable_timeout_after(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4231,6 +4248,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_syscache_update_timeout)
+        {
+            disable_timeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_syscache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..a314f431c6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+    
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        if (fread(&cacheid, sizeof(int), 1, fpin) != 1 ||
+            fread(&last_update, sizeof(TimestampTz), 1, fpin) != 1 ||
+            fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING, 
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }            
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 04a60a490a..fa0d19a9c3 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -109,6 +109,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Clock used to record the last accessed time of a catcache record. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -640,9 +644,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -718,9 +720,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -1030,10 +1030,10 @@ CatCacheCleanupOldEntries(CatCache *cp)
             int        us;
 
             /*
-             * Calculate the duration from the time of the last access to the
-             * "current" time. Since catcacheclock is not advanced within a
-             * transaction, the entries that are accessed within the current
-             * transaction won't be pruned.
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction always get 0 as the result.
              */
             TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
@@ -1459,9 +1459,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1531,9 +1529,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1542,9 +1538,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                         cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1672,9 +1666,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                 cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1785,9 +1777,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1844,9 +1834,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                     cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -2367,3 +2355,68 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_memusage;
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /*
+     * catalog_cache_prune_min_age can be changed on-session, fill it every
+     * time
+     */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] =
+            (int) (catalog_cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * catalog_cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. Since catcacheclock is not advanced within
+             * a transaction, the entries that are accessed within the current
+             * transaction won't be pruned.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 0e8b972a29..b7c647b5e0 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t CatcacheClockTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 9eb50e9676..2f3251e8d5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
 static void CatcacheClockTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
                         IdleInTransactionSessionTimeoutHandler);
         RegisterTimeout(CATCACHE_CLOCK_TIMEOUT,
                         CatcacheClockTimeoutHandler);
+        RegisterTimeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1249,6 +1252,14 @@ CatcacheClockTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c62d5ad8b8..7f1670fa5b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3178,6 +3178,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_catalog_cache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."),
 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aeb5968e75..797f52fa2a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -556,6 +556,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_catlog_cache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 24f99f7fc4..fc35b6be47 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9689,6 +9689,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 33b800e80f..767c94a63c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..b6bfd7d644 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 /* ----------
  * pgstat_report_wait_start() -
  *
@@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern long pgstat_write_syscache_stats(bool force);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 0a714bf514..95cd885c16 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -69,10 +69,8 @@ typedef struct catcache
     int            cc_nfreeent;    /* # of entries currently not referenced */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -85,7 +83,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -276,4 +273,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index b2d97b4f7b..0677978923 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -32,6 +32,7 @@ typedef enum TimeoutId
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
     CATCACHE_CLOCK_TIMEOUT,
+    IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2c8e21baa7..7bd77e9972 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3

From 05a75bff3a48007f393bf5f99e354ec0619d00c9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 13 Feb 2019 14:34:46 +0900
Subject: [PATCH 5/5] Global LRU based cache pruning.

This adds a feature that removes leaset recently used cache entry
among all catcaches when the total memory amount goes avove
catalog_cache_max_size.
---
 doc/src/sgml/config.sgml           |  20 +++++++
 src/backend/utils/cache/catcache.c | 106 +++++++++++++++++++++++--------------
 src/backend/utils/misc/guc.c       |  21 +++-----
 src/include/utils/catcache.h       |   5 +-
 4 files changed, 94 insertions(+), 58 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bc2bef0878..daa6085693 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1701,6 +1701,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-max-size" xreflabel="catalog_cache_max_size">
+      <term><varname>catalog_cache_max_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_max_size</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum total amount of memory allowed for all system
+        catalog caches in kilobytes. The value defaults to 0, indicating that
+        pruning by this parameter is disabled at all. After the amount of
+        memory used by all catalog caches exceeds this size, a new cache entry
+        creation will remove one or more not-recently-used cache entries. This
+        means frequent creation of new cache entry may lead to a slight
+        slowdown of queries.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index fa0d19a9c3..3336ff6dc3 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -86,11 +86,9 @@ int catalog_cache_memory_target = 0;
 
 /*
  * GUC for limit by the number of entries. Entries are removed when the number
- * of them goes above catalog_cache_entry_limit and leaving newer entries by
- * the ratio specified by catalog_cache_prune_ratio.
+ * of them goes above catalog_cache_max_size in kilobytes
  */
-int catalog_cache_entry_limit = 0;
-double catalog_cache_prune_ratio = 0.8;
+int catalog_cache_max_size = 0;
 
 /*
  * Flag to keep track of whether catcache clock timer is active.
@@ -108,6 +106,8 @@ static CatCacheHeader *CacheHdr = NULL;
 
 /* Clock used to record the last accessed time of a catcache record. */
 TimestampTz    catcacheclock = 0;
+dlist_head    cc_lru_list = {0};
+Size        global_size = 0;
 
 /* age classes for pruning */
 static double ageclass[SYSCACHE_STATS_NAGECLASSES]
@@ -531,6 +531,8 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
                          cache->cc_keyno, ct->keys);
 
     cache->cc_memusage -= ct->size;
+    global_size -= ct->size;
+
     pfree(ct);
 
     --cache->cc_ntup;
@@ -887,8 +889,12 @@ InitCatCache(int id,
     /* cc_head_alloc_size + consumed size for cc_bucket */
     cp->cc_memusage =
         MemoryContextGetConsumption(CacheMemoryContext) - base_size;
+    global_size += cp->cc_memusage;
+
+    /* initialize global LRU if not yet */
+    if (cc_lru_list.head.next == NULL)
+        dlist_init(&cc_lru_list);
 
-    dlist_init(&cp->cc_lru_list);
     /*
      * new cache is initialized as far as we can go for now. print some
      * debugging information, if appropriate.
@@ -981,39 +987,27 @@ assign_catalog_cache_prune_min_age(int newval, void *extra)
 static bool
 CatCacheCleanupOldEntries(CatCache *cp)
 {
+    static TimestampTz prev_warn_emit = 0;
     int            nremoved = 0;
     int            nelems_before = cp->cc_ntup;
-    int            ndelelems = 0;
     bool        prune_by_age = false;
-    bool        prune_by_number = false;
+    bool        prune_by_size = false;
     dlist_mutable_iter    iter;
 
-    /* prune only if the size of the hash is above the target */
     if (catalog_cache_prune_min_age >= 0 &&
         cp->cc_memusage > (Size) catalog_cache_memory_target * 1024L)
         prune_by_age = true;
 
-    if (catalog_cache_entry_limit > 0 &&
-        nelems_before >= catalog_cache_entry_limit)
-    {
-        ndelelems = nelems_before -
-            (int) (catalog_cache_entry_limit * catalog_cache_prune_ratio);
-
-        /* an arbitrary lower limit.. */
-        if (ndelelems < 256)
-            ndelelems = 256;
-        if (ndelelems > nelems_before)
-            ndelelems = nelems_before;
-
-        prune_by_number = true;
-    }
+    if (catalog_cache_max_size > 0 &&
+        global_size >= (Size) catalog_cache_max_size * 1024)
+        prune_by_size = true;
 
     /* Return immediately if no pruning is wanted */
-    if (!prune_by_age && !prune_by_number)
+    if (!prune_by_age && !prune_by_size)
         return false;
 
     /* Scan over LRU to find entries to remove */
-    dlist_foreach_modify(iter, &cp->cc_lru_list)
+    dlist_foreach_modify(iter, &cc_lru_list)
     {
         CatCTup    *ct = dlist_container(CatCTup, lru_node, iter.cur);
         bool        remove_this = false;
@@ -1023,8 +1017,8 @@ CatCacheCleanupOldEntries(CatCache *cp)
             (ct->c_list && ct->c_list->refcount != 0))
             continue;
 
-        /* check against age */
-        if (prune_by_age)
+        /* check against age. prune within this cache */
+        if (prune_by_age && ct->owner == cp)
         {
             long    entry_age;
             int        us;
@@ -1056,31 +1050,58 @@ CatCacheCleanupOldEntries(CatCache *cp)
                 remove_this = true;
         }
 
-        /* check against entry number */
-        if (prune_by_number)
+        /* check against global size. removes from all cache */
+        if (prune_by_size && !remove_this)
         {
-            if (nremoved < ndelelems)
+            if (global_size >= (Size) catalog_cache_max_size * 1024)
                 remove_this = true;
             else
-                prune_by_number = false; /* we're satisfied */
+                prune_by_size = false; /* we're satisfied */
         }
 
+        if (!remove_this)
+            continue;
+
         /* exit immediately if all finished */
-        if (!prune_by_age && !prune_by_number)
+        if (!prune_by_age && !prune_by_size)
             break;
 
         /* do the work */
-        if (remove_this)
-        {
-            CatCacheRemoveCTup(cp, ct);
-            nremoved++;
-        }
+        CatCacheRemoveCTup(ct->owner, ct);
+        nremoved++;
     }
 
     if (nremoved > 0)
         elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
              cp->id, cp->cc_relname, nremoved, nelems_before);
 
+    /*
+     * Warn of too small setting of catalog_cache_max_size. Take 5 seconds
+     * between messages, using statement start timestamp to avoid frequent
+     * gettimeofday().
+     */
+    if (prune_by_size &&
+        (prev_warn_emit == 0 ||
+         GetCurrentStatementStartTimestamp() - prev_warn_emit > 5000000))
+    {
+        ErrorContextCallback *oldcb;
+
+        /* cancel error context callbacks  */
+        oldcb = error_context_stack;
+        error_context_stack = NULL;
+        
+        ereport(LOG, (
+                    errmsg ("cannot reduce cache size to %d kilobytes, reduced to %d kilobytes",
+                            catalog_cache_max_size,    (int)(global_size / 1024)),
+                    errdetail ("Consider increasing the configuration parameter \"catalog_cache_max_size\"."),
+                    errhidecontext(true),
+                    errhidestmt(true)));
+
+        error_context_stack = oldcb;
+
+        prev_warn_emit = GetCurrentStatementStartTimestamp();
+    }
+
     return nremoved > 0;
 }
 
@@ -1103,6 +1124,7 @@ RehashCatCache(CatCache *cp)
     newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head));
 
     /* recalculate memory usage from the first */
+    global_size -= cp->cc_memusage;
     cp->cc_memusage = cp->cc_head_alloc_size +
         MemoryContextGetConsumption(CacheMemoryContext) - base_size;
 
@@ -1122,6 +1144,8 @@ RehashCatCache(CatCache *cp)
         }
     }
 
+    global_size += cp->cc_memusage;
+
     /* Switch to the new array. */
     pfree(cp->cc_bucket);
     cp->cc_nbuckets = newnbuckets;
@@ -1513,7 +1537,7 @@ SearchCatCacheInternal(CatCache *cache,
         if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL)
         {
             ct->lastaccess = catcacheclock;
-            dlist_move_tail(&cache->cc_lru_list, &ct->lru_node);
+            dlist_move_tail(&cc_lru_list, &ct->lru_node);
         }
 
         /*
@@ -2138,7 +2162,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->hash_value = hashValue;
     ct->naccess = 0;
     ct->lastaccess = catcacheclock;
-    dlist_push_tail(&cache->cc_lru_list, &ct->lru_node);
+    ct->owner = cache;
+    dlist_push_tail(&cc_lru_list, &ct->lru_node);
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -2147,6 +2172,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 
     ct->size = MemoryContextGetConsumption(CacheMemoryContext) - base_size;
     cache->cc_memusage += ct->size;
+    global_size += ct->size;
 
     /* increase refcount so that this survives pruning */
     ct->refcount++;
@@ -2161,8 +2187,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
     /* we may still want to prune by entry number, check it */
-    else if (catalog_cache_entry_limit > 0 &&
-             cache->cc_ntup > catalog_cache_entry_limit)
+    else if (catalog_cache_max_size > 0 &&
+             global_size > catalog_cache_max_size * 1024)
         CatCacheCleanupOldEntries(cache);
 
     ct->refcount--;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7f1670fa5b..7a52c70649 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2229,12 +2229,13 @@ static struct config_int ConfigureNamesInt[] =
     },
 
     {
-        {"catalog_cache_entry_limit", PGC_USERSET, RESOURCES_MEM,
-            gettext_noop("Sets the maximum entries of catcache."),
-             NULL
+        {"catalog_cache_max_size", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the maximum size of catcache in kilobytes."),
+             NULL,
+             GUC_UNIT_KB
         },
-        &catalog_cache_entry_limit,
-        0, 0, INT_MAX,
+        &catalog_cache_max_size,
+        0, 0, MAX_KILOBYTES,
         NULL, NULL, NULL
     },
 
@@ -3411,16 +3412,6 @@ static struct config_real ConfigureNamesReal[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"catalog_cache_prune_ratio", PGC_USERSET, RESOURCES_MEM,
-            gettext_noop("Reduce ratio of pruning caused by catalog_cache_entry_limit."),
-             NULL
-        },
-        &catalog_cache_prune_ratio,
-        0.8, 0.0, 1.0,
-        NULL, NULL, NULL
-    },
-
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 95cd885c16..1e2d6e7bd7 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -62,7 +62,6 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
-    dlist_head    cc_lru_list;
     int            cc_head_alloc_size;/* consumed memory to allocate this struct */
     int            cc_memusage;    /* memory usage of this catcache (excluding
                                  * header part) */
@@ -125,6 +124,7 @@ typedef struct catctup
     int            naccess;        /* # of access to this entry, up to 2  */
     TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
     dlist_node    lru_node;        /* LRU node */
+    CatCache   *owner;            /* owner catcache */
     int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -198,8 +198,7 @@ extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 /* for guc.c, not PGDLLPMPORT'ed */
 extern int catalog_cache_prune_min_age;
 extern int catalog_cache_memory_target;
-extern int catalog_cache_entry_limit;
-extern double catalog_cache_prune_ratio;
+extern int catalog_cache_max_size;
 
 /* to use as access timestamp of catcache entries */
 extern TimestampTz catcacheclock;
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Wed, 13 Feb 2019 02:15:42 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1FB97CF1@G01JPEXMBYT05>
> From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
> > > I didin't consider planning that happen within a function. If
> > > 5min is the default for catalog_cache_prune_min_age, 10% of it
> > > (30s) seems enough and gettieofday() with such intervals wouldn't
> > > affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
> > > than fixed value 30s and 1s as the minimal.
> > >
> > 
> > Actually, I see CatCacheCleanupOldEntries contains this comment:
> > 
> > /*
> >  * Calculate the duration from the time of the last access to the
> >  * "current" time. Since catcacheclock is not advanced within a
> >  * transaction, the entries that are accessed within the current
> >  * transaction won't be pruned.
> >  */
> > 
> > which I think is pretty much what I've been saying ... But the question
> > is whether we need to do something about it.
> 
> Hmm, I'm surprised at v14 patch about this.  I remember that previous patches renewed the cache clock on every
statement,and it is correct.  If the cache clock is only updated at the beginning of a transaction, the following TODO
itemwould not be solved:
 
> 
> https://wiki.postgresql.org/wiki/Todo

Sorry, its just a stale comment. In v15, it is alreday.... ouch!
still left alone. (Actually CatCacheGetStats doesn't perform
pruning.)  I'll remove it in the next version. It is called in
start_xact_command, which is called per statement, provided with
statement timestamp.


> /*
>  * Calculate the duration from the time from the last access to
>  * the "current" time. catcacheclock is updated per-statement
>  * basis and additionaly udpated periodically during a long
>  * running query.
>  */
> TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);


> " Reduce memory use when analyzing many tables in a single command by making catcache and syscache flushable or
bounded."

In v14 and v15, addition to it a timer that fires with the
interval of catalog_cache_prune_min_age/10 - 30s when the
parameter is 5min - updates the catcache clock using
gettimeofday(), which in turn is the source of LRU timestamp.

> Also, Tom mentioned pg_dump in this thread (protect syscache...).  pg_dump runs in a single transaction, touching all
systemcatalogs.  That may result in OOM, and this patch can rescue it.
 

So, all the problem will be addressed in v14.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Tue, 12 Feb 2019 18:33:46 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<d3b291ff-d993-78d1-8d28-61bcf72793d6@2ndquadrant.com>
> > "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
> > exists) "catalog_cache_entry_limit" and
> > "catalog_cache_prune_ratio" make sense?
> > 
> 
> I think "catalog_cache" sounds about right, although my point was simply
> that there's a discrepancy between sgml docs and code.

system_catalog_cache is too long for parameter names. So I named
parameters "catalog_cache_*" and "system catalog cache" or
"catalog cache" in documentation.

> >> 2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
> >> defined three times in guc.c for some reason.
> > 
> > It is just PoC, added to show how it looks. (The multiple
> > instances must bex a result of a convulsion of my fingers..) I
> > think this is not useful unless it can be specfied per-relation
> > or per-cache basis. I'll remove the GUC and add reloptions for
> > the purpose. (But it won't work for pg_class and pg_attribute
> > for now).
> > 
> 
> OK, although I'd just keep it as simple as possible. TBH I can't really
> imagine users tuning limits for individual caches in any meaningful way.

I also fee like so, but anyway (:p), in v15, it is evoleved into
a feature that limits cache size with the total size based on
global LRU list.

> > I didin't consider planning that happen within a function. If
> > 5min is the default for catalog_cache_prune_min_age, 10% of it
> > (30s) seems enough and gettieofday() with such intervals wouldn't
> > affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
> > than fixed value 30s and 1s as the minimal.
> > 
> 
> Actually, I see CatCacheCleanupOldEntries contains this comment:
> 
> /*
>  * Calculate the duration from the time of the last access to the
>  * "current" time. Since catcacheclock is not advanced within a
>  * transaction, the entries that are accessed within the current
>  * transaction won't be pruned.
>  */
> 
> which I think is pretty much what I've been saying ... But the question
> is whether we need to do something about it.

As I wrote in the messages just replied to Tsunakawa-san, it just
a bogus comment. The corrent one is the following. I'll replace
it in the next version.

> * Calculate the duration from the time from the last access to
> * the "current" time. catcacheclock is updated per-statement
> * basis and additionaly udpated periodically during a long
> * running query.

> > I obeserved significant degradation by setting up timer at every
> > statement start. The patch is doing the followings to get rid of
> > the degradation.
> > 
> > (1) Every statement updates the catcache timestamp as currently
> >     does.  (SetCatCacheClock)
> > 
> > (2) The timestamp is also updated periodically using timer
> >    separately from (1). The timer starts if not yet at the time
> >    of (1).  (SetCatCacheClock, UpdateCatCacheClock)
> > 
> > (3) Statement end and transaction end don't stop the timer, to
> >    avoid overhead of setting up a timer. (
> > 
> > (4) But it stops by error. I choosed not to change the thing in
> >     PostgresMain that it kills all timers on error.
> > 
> > (5) Also changing the GUC catalog_cache_prune_min_age kills the
> >    timer, in order to reflect the change quickly especially when
> >    it is shortened.
> > 
> 
> Interesting. What was the frequency of the timer / how often was it
> executed? Can you share the code somehow?

Please find it in v14 [1] or v15 [2], which contain the same code
for teh purpose.

[1] https://www.postgresql.org/message-id/20190212.203628.118792892.horiguchi.kyotaro@lab.ntt.co.jp

[2] https://www.postgresql.org/message-id/20190213.153114.239737674.horiguchi.kyotaro%40lab.ntt.co.jp

regarsd.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Bruce Momjian
Дата:
On Tue, Feb 12, 2019 at 02:53:40AM +0100, Tomas Vondra wrote:
> Right. But the logic behind time-based approach is that evicting such
> entries should not cause any issues exactly because they are accessed
> infrequently. It might incur some latency when we need them for the
> first time after the eviction, but IMHO that's acceptable (although I
> see Andres did not like that).
> 
> FWIW we might even evict entries after some time passes since inserting
> them into the cache - that's what memcached et al do, IIRC. The logic is
> that frequently accessed entries will get immediately loaded back (thus
> keeping cache hit ratio high). But there are reasons why the other dbs
> do that - like not having any cache invalidation (unlike us).

Agreed.  If this fixes 90% of the issues people will have, and it
applies to the 99.9% of users who will never tune this, it is a clear
win.  If we want to add something that requires tuning later, we can
consider it once the non-tuning solution is done.

> That being said, having a "minimal size" threshold before starting with
> the time-based eviction may be a good idea.

Agreed.  I see the minimal size as a way to keep the systems tables in
cache, which we know we will need for the next query.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Bruce Momjian [mailto:bruce@momjian.us]
> > That being said, having a "minimal size" threshold before starting with
> > the time-based eviction may be a good idea.
> 
> Agreed.  I see the minimal size as a way to keep the systems tables in
> cache, which we know we will need for the next query.

Isn't it the maximum size, not minimal size?  Maximum size allows to keep desired amount of system tables in memory as
wellas to control memory consumption to avoid out-of-memory errors (OS crash!).  I'm wondering why people want to take
adifferent approach to catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers, temp_buffers, SLRU
buffers,work_mem, and other DBMSs.
 


Regards
Takayuki Tsunakawa





RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> It is too complex as I was afraid. The indirect calls causes siginicant
> degradation. (Anyway the previous code was bogus in that it passes
> CACHELINEALIGN'ed pointer to get_chunk_size..)
> 
> Instead, I added an accounting(?) interface function.
> 
> | MemoryContextGettConsumption(MemoryContext cxt);
> 
> The API returns the current consumption in this memory context. This allows
> "real" memory accounting almost without overhead.

That looks like a great idea!  Actually, I was thinking of using MemoryContextStats() or its new lightweight variant to
getthe used amount, but I was afraid it would be too costly to call in catcache code.  You are smarter, and I was just
stupid.


> (2) Another new patch v15-0005 on top of previous design of
>   limit-by-number-of-a-cache feature converts it to
>   limit-by-size-on-all-caches feature, which I think is
>   Tsunakawa-san wanted.

Thank you very, very much!  I look forward to reviewing v15.  I'll be away from the office tomorrow, so I'd like to
reviewit on this weekend or the beginning of next week.  I've confirmed and am sure that 0001 can be committed.
 




> As far as I can see no significant degradation is found in usual (as long
> as pruning doesn't happen) code paths.
> 
> About the new global-size based evicition(2), cache entry creation becomes
> slow after the total size reached to the limit since every one new entry
> evicts one or more old (=
> not-recently-used) entries. Because of not needing knbos for each cache,
> it become far realistic. So I added documentation of
> "catalog_cache_max_size" in 0005.

Could you show us the comparison of before and after the pruning starts, if you already have it?  If you lost the data,
I'mOK to see the data after the code review.
 


Regards
Takayuki Tsunakawa




RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
>
>
>(2) Another new patch v15-0005 on top of previous design of
>  limit-by-number-of-a-cache feature converts it to
>  limit-by-size-on-all-caches feature, which I think is
>  Tsunakawa-san wanted.
Yeah, size looks better to me.

>As far as I can see no significant degradation is found in usual (as long as pruning
>doesn't happen) code paths.
>
>About the new global-size based evicition(2), cache entry creation becomes slow after
>the total size reached to the limit since every one new entry evicts one or more old (=
>not-recently-used) entries. Because of not needing knbos for each cache, it become
>far realistic. So I added documentation of "catalog_cache_max_size" in 0005.

Now I'm also trying to benchmark, which will be posted in another email.

Here are things I noticed:

[1] compiler warning
catcache.c:109:1: warning: missing braces around initializer [-Wmissing-braces]
 dlist_head cc_lru_list = {0};
 ^
catcache.c:109:1: warning: (near initialization for ‘cc_lru_list.head’) [-Wmissing-braces]

[2] catalog_cache_max_size is not appered in postgresql.conf.sample

[3] global lru list and global size can be included in CatCacheHeader, which seems to me 
    good place because this structure contains global cache information regardless of kind of CatCache 

[4] when applying patch with git am, there are several warnings about trailing white space at v15-0003

Regards,
Takeshi Ideriha


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
Hi,

On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
> Instead, I added an accounting(?) interface function.
> 
> | MemoryContextGettConsumption(MemoryContext cxt);
> 
> The API returns the current consumption in this memory
> context. This allows "real" memory accounting almost without
> overhead.

That's definitely *NOT* almost without overhead. This adds additional
instructions to one postgres' hottest set of codepaths.

I think you're not working incrementally enough here. I strongly suggest
solving the negative cache entry problem, and then incrementally go from
there after that's committed. The likelihood of this patch ever getting
merged otherwise seems extremely small.

Greetings,

Andres Freund


Re: Protect syscache from bloating with negative cache entries

От
Bruce Momjian
Дата:
On Thu, Feb 14, 2019 at 12:40:10AM -0800, Andres Freund wrote:
> Hi,
> 
> On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
> > Instead, I added an accounting(?) interface function.
> > 
> > | MemoryContextGettConsumption(MemoryContext cxt);
> > 
> > The API returns the current consumption in this memory
> > context. This allows "real" memory accounting almost without
> > overhead.
> 
> That's definitely *NOT* almost without overhead. This adds additional
> instructions to one postgres' hottest set of codepaths.
> 
> I think you're not working incrementally enough here. I strongly suggest
> solving the negative cache entry problem, and then incrementally go from
> there after that's committed. The likelihood of this patch ever getting
> merged otherwise seems extremely small.

Agreed --- the patch is going in the wrong direction.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Protect syscache from bloating with negative cache entries

От
'Bruce Momjian'
Дата:
On Thu, Feb 14, 2019 at 01:31:49AM +0000, Tsunakawa, Takayuki wrote:
> From: Bruce Momjian [mailto:bruce@momjian.us]
> > > That being said, having a "minimal size" threshold before starting
> > > with the time-based eviction may be a good idea.
> >
> > Agreed.  I see the minimal size as a way to keep the systems tables
> > in cache, which we know we will need for the next query.
>
> Isn't it the maximum size, not minimal size?  Maximum size allows
> to keep desired amount of system tables in memory as well as to
> control memory consumption to avoid out-of-memory errors (OS crash!).
> I'm wondering why people want to take a different approach to
> catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers,
> temp_buffers, SLRU buffers, work_mem, and other DBMSs.

Well, that is an _excellent_ question, and one I had to think about.

I think, in general, smaller is better, as long as making something
smaller doesn't remove data that is frequently accessed.  Having a timer
to expire only old entries seems like it accomplished this goal.

Having a minimum size and not taking it to zero size makes sense if we
know we will need certain entries like pg_class in the next query. 
However, if the session is idle for hours, we should just probably
remove everything, so maybe the minimum doesn't make sense --- just
remove everything.

As for why we don't do this with everything --- we can't do it with
shared_buffers since we can't change its size while the server is
running.  For work_mem, we assume all the work_mem data is for the
current query, and therefore frequently accessed.  Also, work_mem is not
memory we can just free if it is not used since it contains intermediate
results required by the current query.  I think temp_buffers, since it
can be resized in the session, actually could use a similar minimizing
feature, though that would mean it behaves slightly differently from
shared_buffers, and it might not be worth it.  Also, I assume the value
of temp_buffers was mostly for use by the current query --- yes, it can
be used for cross-query caching, but I am not sure if that is its
primary purpose.  I thought its goal was to prevent shared_buffers from
being populated with temporary per-session buffers.

I don't think other DBMSs are a good model since they have a reputation
for requiring a lot of tuning --- tuning that we have often automated.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]

>>About the new global-size based evicition(2), cache entry creation
>>becomes slow after the total size reached to the limit since every one
>>new entry evicts one or more old (=
>>not-recently-used) entries. Because of not needing knbos for each
>>cache, it become far realistic. So I added documentation of
>"catalog_cache_max_size" in 0005.
>
>Now I'm also trying to benchmark, which will be posted in another email.

According to recent comments by Andres and Bruce
maybe we should address negative cache bloat step by step 
for example by reviewing Tom's patch. 

But at the same time, I did some benchmark with only hard limit option enabled
and time-related option disabled, because the figures of this case are not provided in this thread.
So let me share it.

I did two experiments. One is to show negative cache bloat is suppressed.
This thread originated from the issue that negative cache of pg_statistics 
is bloating as creating and dropping temp table is repeatedly executed.
https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro%40lab.ntt.co.jp  
Using the script attached the first email in this thread, I repeated create and drop temp table at 10000 times.
(experiment is repeated 5 times. catalog_cache_max_size = 500kB. 
 compared master branch and patch with hard memory limit)

Here are TPS and CacheMemoryContext 'used' memory (total - freespace) calculated by MemoryContextPrintStats()
at 100, 1000, 10000 times of create-and-drop transaction. The result shows cache bloating is suppressed
after exceeding the limit (at 10000) but tps declines regardless of the limit.

number of tx (create and drop)       | 100  |1000    |10000 
-----------------------------------------------------------
used CacheMemoryContext  (master) |610296|2029256 |15909024
used CacheMemoryContext  (patch)  |755176|880552  |880592
-----------------------------------------------------------
TPS (master)                         |414   |407     |399
TPS (patch)                           |242   |225     |220


Another experiment is using Tomas's script posted while ago,
The scenario is do select 1 from multiple tables randomly (uniform distribution).
(experiment is repeated 5 times. catalog_cache_max_size = 10MB. 
 compared master branch and patch with only hard memory limit enabled)

Before doing the benchmark, I checked pruning is happened only at 10000 tables
using debug option. The result shows degradation regardless of before or after pruning. 
I personally still need hard size limitation but I'm surprised that the difference is so significant.

number of tables   | 100  |1000    |10000 
-----------------------------------------------------------
TPS (master)       |10966  |10654 |9099
TPS (patch)        |4491   |2099 |378

Regards,
Takeshi Ideriha



Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:


On 2/13/19 1:23 AM, Tsunakawa, Takayuki wrote:
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
>> I'm at a loss how call syscache for users. I think it is "catalog
>> cache". The most basic component is called catcache, which is
>> covered by the syscache layer, both of then are not revealed to
>> users, and it is shown to user as "catalog cache".
>>
>> "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
>> exists) "catalog_cache_entry_limit" and
>> "catalog_cache_prune_ratio" make sense?
> 
> PostgreSQL documentation uses "system catalog" in its table of contents, so syscat_cache_xxx would be a bit more
familiar? I'm for either catalog_ and syscat_, but what name shall we use for the relation cache?  catcache and
relcachehave different element sizes and possibly different usage patterns, so they may as well have different
parametersjust like MySQL does.  If we follow that idea, then the name would be relation_cache_xxx.  However, from the
user'sviewpoint, the relation cache is also created from the system catalog like pg_class and pg_attribute...
 
> 

I think "catalog_cache_..." is fine. If we end up with a similar
patchfor relcache, we can probably call it "relation_cache_".

I'd be OK even with "system_catalog_cache_..." - I don't think it's
overly long (better to have a longer but descriptive name), and "syscat"
just seems like unnecessary abbreviation.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:

On 2/14/19 3:46 PM, Bruce Momjian wrote:
> On Thu, Feb 14, 2019 at 12:40:10AM -0800, Andres Freund wrote:
>> Hi,
>>
>> On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
>>> Instead, I added an accounting(?) interface function.
>>>
>>> | MemoryContextGettConsumption(MemoryContext cxt);
>>>
>>> The API returns the current consumption in this memory
>>> context. This allows "real" memory accounting almost without
>>> overhead.
>>
>> That's definitely *NOT* almost without overhead. This adds additional
>> instructions to one postgres' hottest set of codepaths.
>>
>> I think you're not working incrementally enough here. I strongly suggest
>> solving the negative cache entry problem, and then incrementally go from
>> there after that's committed. The likelihood of this patch ever getting
>> merged otherwise seems extremely small.
> 
> Agreed --- the patch is going in the wrong direction.
> 

I recall endless discussions about memory accounting in the
"memory-bounded hash-aggregate" patch a couple of years ago, and the
overhead was one of the main issues there. So yeah, trying to solve that
problem here is likely to kill this patch (or at least significantly
delay it).

ISTM there's a couple of ways to deal with that:

1) Ignore the memory amounts entirely, and do just time-base eviction.

2) If we want some size thresholds (e.g. to disable eviction for
backends with small caches etc.) use the number of entries instead. I
don't think that's particularly worse that specifying size in MB.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:

On 2/14/19 4:49 PM, 'Bruce Momjian' wrote:
> On Thu, Feb 14, 2019 at 01:31:49AM +0000, Tsunakawa, Takayuki wrote:
>> From: Bruce Momjian [mailto:bruce@momjian.us]
>>>> That being said, having a "minimal size" threshold before starting
>>>> with the time-based eviction may be a good idea.
>>>
>>> Agreed.  I see the minimal size as a way to keep the systems tables
>>> in cache, which we know we will need for the next query.
>>
>> Isn't it the maximum size, not minimal size?  Maximum size allows
>> to keep desired amount of system tables in memory as well as to
>> control memory consumption to avoid out-of-memory errors (OS crash!).
>> I'm wondering why people want to take a different approach to
>> catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers,
>> temp_buffers, SLRU buffers, work_mem, and other DBMSs.
> 
> Well, that is an _excellent_ question, and one I had to think about.
> 

I think we're talking about two different concepts here:

1) minimal size - We don't do any extra eviction at all, until we reach
this cache size. So we don't get any extra overhead from it. If a system
does not have issues.

2) maximal size - We ensure the cache size is below this threshold. If
there's more data, we evict enough entries to get below it.

My proposal is essentially to do just (1), so the cache can grow very
large if needed but then it shrinks again after a while.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com]
> I think "catalog_cache_..." is fine. If we end up with a similar
> patchfor relcache, we can probably call it "relation_cache_".

Agreed, those are not too long or too short, and they are sufficiently descriptive.


Regards
Takayuki Tsunakawa




Re: Protect syscache from bloating with negative cache entries

От
Alvaro Herrera
Дата:
On 2019-Feb-15, Tomas Vondra wrote:

> ISTM there's a couple of ways to deal with that:
> 
> 1) Ignore the memory amounts entirely, and do just time-base eviction.
> 
> 2) If we want some size thresholds (e.g. to disable eviction for
> backends with small caches etc.) use the number of entries instead. I
> don't think that's particularly worse that specifying size in MB.

Why is there a *need* for size-based eviction?  Seems that time-based
should be sufficient.  Is the proposed approach to avoid eviction at all
until the size threshold has been reached?  I'm not sure I see the point
of that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
Hi Horiguchi-san,

I've looked through your patches.  This is the first part of my review results.  Let me post the rest after another
worktoday.
 

BTW, how about merging 0003 and 0005, and separating and deferring 0004 in another thread?  That may help to relieve
othercommunity members by making this patch set not so large and complex.
 



[Bottleneck investigation]
Ideriha-san and I are trying to find the bottleneck.  My first try shows there's little overhead.  Here's what I did:

<postgresql.conf>
shared_buffers = 1GB
catalog_cache_prune_min_age = -1
catalog_cache_max_size = 10MB

<benchmark>
$ pgbench -i -s 10
$ pg_ctl stop and then start
$ cache all data in shared buffers by running pg_prewarm on branches, tellers, accounts, and their indexes
$ pgbench --select-only -c 1 -T 60

<result>
master : 8612 tps
patched: 8553 tps (-0.7%)

There's little (0.7%) performance overhead with:
* one additional dlist_move_tail() in every catcache access
* memory usage accounting in operations other than catcache access (relevant catcache entries should be cached in the
firstpgbench transaction)
 

I'll check other patterns to find out how big overhead there is.


[Source code review]
Below are my findings on the patch set v15:

(1) patch 0001
All right.


(2) patch 0002
@@ -87,6 +87,7 @@ typedef struct MemoryContextData
     const char *name;            /* context name (just for debugging) */
     const char *ident;            /* context ID if any (just for debugging) */
     MemoryContextCallback *reset_cbs;    /* list of reset/delete callbacks */
+    uint64        consumption;    /* accumulates consumed memory size */
 } MemoryContextData;

Size is more appropriate as a data type than uint64 because other places use Size for memory size variables.

How about "usedspace" instead of "consumption"?  Because that aligns better with the naming used for
MemoryContextCounters'smember variables, totalspace and freespace.
 


(3) patch 0002
+        context->consumption += chunk_size;
(and similar sites)

The used space should include the size of the context-type-specific chunk header, so that the count is closer to the
actualmemory size seen by the user.
 

Here, let's make consensus on what the used space represents.  Is it either of the following?

a) The total space allocated from OS.  i.e., the sum of the malloc()ed regions for a given memory context.
b) The total space of all chunks, including their headers, of a given memory context.

a) is better because that's the actual memory usage from the DBA's standpoint.  But a) cannot be used because
CacheMemoryContextis used for various things.  So we have to compromise on b).  Is this OK?
 

One possible future improvement is to use a separate memory context exclusively for the catcache, which is a child of
CacheMemoryContext. That way, we can adopt a).
 



(4) patch 0002
@@ -614,6 +614,9 @@ AllocSetReset(MemoryContext context)
+    set->header.consumption = 0;

This can be put in MemoryContextResetOnly() instead of context-type-specific reset functions.


Regards
Takayuki Tsunakawa




RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
Hi Horiguchi-san,

This is the rest of my review comments.



(5) patch 0003
        CatcacheClockTimeoutPending = 0;
+
+        /* Update timetamp then set up the next timeout */
+

false is better than 0, to follow other **Pending variables.

timetamp -> timestamp


(6) patch 0003
GetCatCacheClock() is not used now.  Why don't we add it when the need arises?


(7) patch 0003
Why don't we remove the catcache timer (Setup/UpdateCatCacheClockTimer), unless we need it by all means?  That
simplifiesthe code.
 

Long-running queries can be thought as follows:

* A single lengthy SQL statement, e.g. SELECT for reporting/analytics, COPY for data loading, and UPDATE/DELETE for
batchprocessing, should only require small number of catalog entries during their query analysis/planning.  They won't
sufferfrom cache eviction during query execution.
 

* Do not have to evict cache entries while executing a long-running stored procedure, because its constituent SQL
statementsmay access the same tables.  If the stored procedure accesses so many tables that you are worried about the
catcachememory overuse, then catalog_cache_max_size can be used.  Another natural idea would be to update the cache
clockwhen SPI executes each SQL statement.
 


(8) patch 0003
+    uint64        base_size;
+    uint64        base_size = MemoryContextGetConsumption(CacheMemoryContext);

This may also as well be Size, not uint64.


(9) patch 0003
@@ -1940,7 +2208,7 @@ CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys)
 /*
  * Helper routine that copies the keys in the srckeys array into the dstkeys
  * one, guaranteeing that the datums are fully allocated in the current memory
- * context.
+ * context. Returns allocated memory size.
  */
 static void
 CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
@@ -1976,7 +2244,6 @@ CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                                att->attbyval,
                                att->attlen);
     }
-
 }

This change seem to be no longer necessary thanks to the memory accounting.


(10) patch 0004
How about separating this in another thread, so that the rest of the patch set becomes easier to review and commit?

Regarding the design, I'm inclined to avoid each backend writing the file.  To simplify the code, I think we can take
advantageof the fortunate situation -- the number of backends and catcaches are fixed at server startup.  My rough
sketchis:
 

* Allocate an array of statistics entries in shared memory, whose element is (pid or backend id, catcache id or name,
hits,misses, ...).  The number of array elements is MaxBackends * number of catcaches (some dozens).
 

* Each backend updates its own entry in the shared memory during query execution.

* Stats collector periodically scans the array and write it to the stats file.


(11) patch 0005
+dlist_head    cc_lru_list = {0};
+Size        global_size = 0;

It is better to put these in CatCacheHeader.  That way, backends that do not access the catcache (archiver, stats
collector,etc.) do not have to waste memory for these global variables.
 


(12) patch 0005
+    else if (catalog_cache_max_size > 0 &&
+             global_size > catalog_cache_max_size * 1024)
         CatCacheCleanupOldEntries(cache);

On the second line, catalog_cache_max_size should be cast to Size to avoid overflow.


(13) patch 0005
+            gettext_noop("Sets the maximum size of catcache in kilobytes."),

catcache -> catalog cache


(14) patch 0005
+    CatCache   *owner;            /* owner catcache */

CatCTup already has my_cache member.


(15) patch 0005
     if (nremoved > 0)
         elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
              cp->id, cp->cc_relname, nremoved, nelems_before);

In prune-by-size case, this elog doesn't very meaningful data.  How about dividing this function into two, one is for
prune-by-ageand another for prune-by-size?  I supppose that would make the functions easier to understand.
 


Regards
Takayuki Tsunakawa




RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: 'Bruce Momjian' [mailto:bruce@momjian.us]
> I think, in general, smaller is better, as long as making something
> smaller doesn't remove data that is frequently accessed.  Having a timer
> to expire only old entries seems like it accomplished this goal.
> 
> Having a minimum size and not taking it to zero size makes sense if we
> know we will need certain entries like pg_class in the next query.
> However, if the session is idle for hours, we should just probably
> remove everything, so maybe the minimum doesn't make sense --- just
> remove everything.

That's another interesting idea.  A somewhat relevant feature is Oracle's "ALTER SYSTEM FLUSH SHARED_POOL".  It flushes
alldictionary cache, library cache, and SQL plan entries.  The purpose is different: not to release memory, but to
defragmentthe shared memory.
 


> I don't think other DBMSs are a good model since they have a reputation
> for requiring a lot of tuning --- tuning that we have often automated.

Yeah, I agree that PostgreSQL is easier to use in many aspects.

On the other hand, although I hesitate to say this (please don't get upset...), I feel PostgreSQL is a bit too loose
aboutmemory usage.  To my memory, PostgreSQL crashed OS due to OOM in our user environments:
 

* Creating and dropping temp tables repeatedly in a stored PL/pgSQL function.  This results in infinite
CacheMemoryContextbloat.  This is referred to at the beginning of this mail thread.
 
Oracle and MySQL can limit the size of the dictionary cache.

* Each pair of SAVEPOINT/RELEASE leaves 8KB of CurTransactionContext.  The customer used psqlODBC to run a batch app,
whichran millions of SQL statements in a transaction.  psqlODBC wraps each SQL statement with SAVEPOINT and RELEASE by
default.
I guess this is what caused the crash of AWS Aurora in last year's Amazon Prime Day.

* Setting a large value to work_mem, and then run many concurrent large queries.
Oracle can limit the total size of all sessions' memory with PGA_AGGREGATE_TARGET parameter.


We all have to manage things within resource constraints.  The DBA wants to make sure the server doesn't overuse memory
toavoid crash or slowdown due to swapping.  Oracle does it, and another open source database, MySQL, does it too.
PostgreSQLdoes it with shared_buffers, wal_buffers, and work_mem (within a single session).  Then, I thought it's
naturalto do it with catcache/relcache/plancache.
 


Regards
Takayuki Tsunakawa






Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:
On 2/19/19 12:43 AM, Tsunakawa, Takayuki wrote:
> Hi Horiguchi-san,
> 
> I've looked through your patches.  This is the first part of my review results.  Let me post the rest after another
worktoday.
 
> 
> BTW, how about merging 0003 and 0005, and separating and deferring 0004 in another thread?  That may help to relieve
othercommunity members by making this patch set not so large and complex.
 
> 
> 
> 
> [Bottleneck investigation]
> Ideriha-san and I are trying to find the bottleneck.  My first try shows there's little overhead.  Here's what I
did:
> 
> <postgresql.conf>
> shared_buffers = 1GB
> catalog_cache_prune_min_age = -1
> catalog_cache_max_size = 10MB
> 
> <benchmark>
> $ pgbench -i -s 10
> $ pg_ctl stop and then start
> $ cache all data in shared buffers by running pg_prewarm on branches, tellers, accounts, and their indexes
> $ pgbench --select-only -c 1 -T 60
> 
> <result>
> master : 8612 tps
> patched: 8553 tps (-0.7%)
> 
> There's little (0.7%) performance overhead with:
> * one additional dlist_move_tail() in every catcache access
> * memory usage accounting in operations other than catcache access (relevant catcache entries should be cached in the
firstpgbench transaction)
 
> 
> I'll check other patterns to find out how big overhead there is.
> 

0.7% may easily be just a noise, possibly due to differences in layout
of the binary. How many runs? What was the variability of the results
between runs? What hardware was this tested on?

FWIW I doubt tests with such small small schema are proving anything -
the cache/lists are likely tiny. That's why I tested with much larger
number of relations.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
>But at the same time, I did some benchmark with only hard limit option enabled and
>time-related option disabled, because the figures of this case are not provided in this
>thread.
>So let me share it.

I'm sorry but I'm taking back result about patch and correcting it.
I configured postgresql (master) with only CFLAGS=O2
but I misconfigured postgres (path applied) with 
--enable-cassert --enable-debug --enable-tap-tests 'CFLAGS=-O0'.
These debug option (especially --enable-cassert) caused enourmous overhead.
(I thought I checked the configure option.. I was maybe tired.)
So I changed these to only 'CFLAGS=-O2' and re-measured them.

>I did two experiments. One is to show negative cache bloat is suppressed.
>This thread originated from the issue that negative cache of pg_statistics is bloating as
>creating and dropping temp table is repeatedly executed.
>https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyot
>aro%40lab.ntt.co.jp
>Using the script attached the first email in this thread, I repeated create and drop
>temp table at 10000 times.
>(experiment is repeated 5 times. catalog_cache_max_size = 500kB.
> compared master branch and patch with hard memory limit)
>
>Here are TPS and CacheMemoryContext 'used' memory (total - freespace) calculated
>by MemoryContextPrintStats() at 100, 1000, 10000 times of create-and-drop
>transaction. The result shows cache bloating is suppressed after exceeding the limit
>(at 10000) but tps declines regardless of the limit.
>
>number of tx (create and drop)       | 100  |1000    |10000
>-----------------------------------------------------------
>used CacheMemoryContext  (master) |610296|2029256 |15909024 used
>CacheMemoryContext  (patch)  |755176|880552  |880592
>-----------------------------------------------------------
>TPS (master)                         |414   |407     |399
>TPS (patch)                           |242   |225     |220

Correct one:
number of tx (create and drop)       | 100  |1000    |10000
-----------------------------------------------------------
TPS (master)                         |414   |407     |399
TPS (patch)                           |447   |415     |409

The results between master and patch is almost same.


>Another experiment is using Tomas's script posted while ago, The scenario is do select
>1 from multiple tables randomly (uniform distribution).
>(experiment is repeated 5 times. catalog_cache_max_size = 10MB.
> compared master branch and patch with only hard memory limit enabled)
>
>Before doing the benchmark, I checked pruning is happened only at 10000 tables using
>debug option. The result shows degradation regardless of before or after pruning.
>I personally still need hard size limitation but I'm surprised that the difference is so
>significant.
>
>number of tables   | 100  |1000    |10000
>-----------------------------------------------------------
>TPS (master)       |10966  |10654 |9099
>TPS (patch)        |4491   |2099 |378

Correct one:
number of tables   | 100  |1000    |10000
-----------------------------------------------------------
TPS (master)       |10966  |10654 |9099
TPS (patch)        | 11137 (+1%) |10710 (+0%) |772 (-91%)

It seems that before cache exceeding the limit (no pruning at 100 and 1000),
the results are almost same with master but after exceeding the limit (at 10000) 
the decline happens.


Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
> number of tables   | 100  |1000    |10000
> -----------------------------------------------------------
> TPS (master)       |10966  |10654 |9099
> TPS (patch)        | 11137 (+1%) |10710 (+0%) |772 (-91%)
> 
> It seems that before cache exceeding the limit (no pruning at 100 and 1000),
> the results are almost same with master but after exceeding the limit (at
> 10000)
> the decline happens.

How many concurrent clients?

Can you show the perf's call graph sampling profiles of both the unpatched and patched version, to confirm that the
bottleneckis around catcache eviction and refill?
 


Regards
Takayuki Tsunakawa



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Thu, 14 Feb 2019 00:40:10 -0800, Andres Freund <andres@anarazel.de> wrote in
<20190214084010.bdn6tmba2j7szo3m@alap3.anarazel.de>
> Hi,
> 
> On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote:
> > Instead, I added an accounting(?) interface function.
> > 
> > | MemoryContextGettConsumption(MemoryContext cxt);
> > 
> > The API returns the current consumption in this memory
> > context. This allows "real" memory accounting almost without
> > overhead.
> 
> That's definitely *NOT* almost without overhead. This adds additional
> instructions to one postgres' hottest set of codepaths.

I'm not sure how much the two instructions in AllocSetAlloc
actually impacts, but I agree that it is doubtful that the
size-limit feature worth the possible slowdown in any extent.

# I faintly remember that I tried the same thing before..

> I think you're not working incrementally enough here. I strongly suggest
> solving the negative cache entry problem, and then incrementally go from
> there after that's committed. The likelihood of this patch ever getting
> merged otherwise seems extremely small.

Mmm. Scoping to the negcache prolem, my very first patch posted
two-years ago does that based on invalidation for pg_statistic
and pg_class, like I think Tom have suggested somewhere in this
thread.

https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro@lab.ntt.co.jp

This is completely different approach from the current shape and
it would be useless after pruning is introduced. So I'd like to
go for the generic pruning by age.

Difference from v15:

  Removed AllocSet accounting stuff. We use approximate memory
  size for catcache.

  Removed prune-by-number(or size) stuff.

  Adressing comments from Tsunakawa-san and Ideriha-san .

  Separated catcache monitoring feature. (Removed from this set)
    (But it is crucial to check this feature...)


Is this small enough ?

regards.  

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 191496e02abd4d7b261705e8d2a0ef4aed5827c7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/2] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
     dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+    /* fast path if it's already at the tail */
+    if (head->head.prev == node)
+        return;
+
+    dlist_delete(node);
+    dlist_push_tail(head, node);
+
+    dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

From 59f53da08abb70398611b33f635b46bda87a7534 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 16 Oct 2018 13:04:30 +0900
Subject: [PATCH 2/2] Remove entries that haven't been used for a certain time

Catcache entries can be left alone for several reasons. It is not
desirable that they eat up memory. With this patch, This adds
consideration of removal of entries that haven't been used for a
certain time before enlarging the hash array.

This also can put a hard limit on the number of catcache entries.
---
 doc/src/sgml/config.sgml                      |  40 +++++
 src/backend/tcop/postgres.c                   |  13 ++
 src/backend/utils/cache/catcache.c            | 243 ++++++++++++++++++++++++--
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  23 +++
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/include/miscadmin.h                       |   1 +
 src/include/utils/catcache.h                  |  43 ++++-
 src/include/utils/timeout.h                   |   1 +
 10 files changed, 364 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b..7a93aef659 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1661,6 +1661,46 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age">
+      <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_prune_min_age</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the minimum amount of unused time in seconds at which a
+        system catalog cache entry is removed. -1 indicates that this feature
+        is disabled at all. The value defaults to 300 seconds (<literal>5
+        minutes</literal>). The catalog cache entries that are not used for
+        the duration can be removed to prevent it from being filled up with
+        useless entries. This behaviour is muted until the size of a catalog
+        cache exceeds <xref linkend="guc-catalog-cache-memory-target"/>.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-catalog-cache-memory-target" xreflabel="catalog_cache_memory_target">
+      <term><varname>catalog_cache_memory_target</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_memory_target</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the maximum amount of memory to which a system catalog cache
+        can expand without pruning in kilobytes. The value defaults to 0,
+        indicating that age-based pruning is always considered. After
+        exceeding this size, catalog cache starts pruning according to
+        <xref linkend="guc-catalog-cache-prune-min-age"/>. If you need to keep
+        certain amount of catalog cache entries with intermittent usage, try
+        increase this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..d9a54ed37f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2584,6 +2585,7 @@ start_xact_command(void)
      * not desired, the timeout has to be disabled explicitly.
      */
     enable_statement_timeout();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 static void
@@ -3159,6 +3161,14 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (CatcacheClockTimeoutPending)
+    {
+        CatcacheClockTimeoutPending = false;
+
+        /* Update timestamp then set up the next timeout */
+        UpdateCatCacheClock();
+    }
 }
 
 
@@ -4021,6 +4031,9 @@ PostgresMain(int argc, char *argv[],
         QueryCancelPending = false; /* second to avoid race condition */
         stmt_timeout_active = false;
 
+        /* get sync with the timer state */
+        catcache_clock_timeout_active = false;
+
         /* Not reading from the client anymore. */
         DoingCommandRead = false;
 
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 78dd5714fa..30ab710aaa 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -39,6 +39,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -61,9 +62,35 @@
 #define CACHE_elog(...)
 #endif
 
+/* GUC variable to define the minimum age of entries that will be considered to
+ * be evicted in seconds. This variable is shared among various cache
+ * mechanisms.
+ */
+int catalog_cache_prune_min_age = 300;
+
+/*
+ * GUC variable to define the minimum size of hash to cosider entry eviction.
+ * This variable is shared among various cache mechanisms.
+ */
+int catalog_cache_memory_target = 0;
+
+/*
+ * Flag to keep track of whether catcache clock timer is active.
+ */
+bool catcache_clock_timeout_active = false;
+
+/*
+ * Minimum interval between two success move of a cache entry in LRU list,
+ * in microseconds.
+ */
+#define MIN_LRU_UPDATE_INTERVAL 100000    /* 100ms */
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock used to record the last accessed time of a catcache record. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -97,7 +124,7 @@ static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp,
 
 static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                  Datum *keys);
-static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
+static size_t CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                  Datum *srckeys, Datum *dstkeys);
 
 
@@ -469,6 +496,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 
     /* delink from linked list */
     dlist_delete(&ct->cache_elem);
+    dlist_delete(&ct->lru_node);
 
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
@@ -478,6 +506,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_memusage -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -811,7 +840,9 @@ InitCatCache(int id,
      */
     sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
     cp = (CatCache *) CACHELINEALIGN(palloc0(sz));
-    cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head));
+    cp->cc_head_size = sz;
+    sz = nbuckets * sizeof(dlist_head);
+    cp->cc_bucket = palloc0(sz);
 
     /*
      * initialize the cache's relation information for the relation
@@ -830,6 +861,9 @@ InitCatCache(int id,
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
 
+    cp->cc_memusage = cp->cc_head_size + sz;
+
+    dlist_init(&cp->cc_lru_list);
     /*
      * new cache is initialized as far as we can go for now. print some
      * debugging information, if appropriate.
@@ -846,9 +880,143 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * helper routine for SetCatCacheClock and UpdateCatCacheClockTimer.
+ *
+ * We need to maintain the catcache clock during a long query.
+ */
+void
+SetupCatCacheClockTimer(void)
+{
+    long delay;
+
+    /* stop timer if not needed */
+    if (catalog_cache_prune_min_age == 0)
+    {
+        catcache_clock_timeout_active = false;
+        return;
+    }
+
+    /* One 10th of the variable, in milliseconds */
+    delay  = catalog_cache_prune_min_age * 1000/10;
+
+    /* Lower limit is 1 second */
+    if (delay < 1000)
+        delay = 1000;
+
+    enable_timeout_after(CATCACHE_CLOCK_TIMEOUT, delay);
+
+    catcache_clock_timeout_active = true;
+}
+
+/*
+ * Update catcacheclock: this is intended to be called from
+ * CATCACHE_CLOCK_TIMEOUT. The interval is expected more than 1 second (see
+ * above), so GetCurrentTime() doesn't harm.
+ */
+void
+UpdateCatCacheClock(void)
+{
+    catcacheclock = GetCurrentTimestamp();
+    SetupCatCacheClockTimer();
+}
+
+/*
+ * It may take an unexpectedly long time before the next clock update when
+ * catalog_cache_prune_min_age gets shorter. Disabling the current timer let
+ * the next update happen at the expected interval. We don't necessariry
+ * require this for increase the age but we don't need to avoid to disable
+ * either.
+ */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    if (catcache_clock_timeout_active)
+        disable_timeout(CATCACHE_CLOCK_TIMEOUT, false);
+
+    catcache_clock_timeout_active = false;
+}
+
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries can be left alone for several reasons. We remove them if
+ * they are not accessed for a certain time to prevent catcache from
+ * bloating. The eviction is performed with the similar algorithm with buffer
+ * eviction using access counter. Entries that are accessed several times can
+ * live longer than those that have had less access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            nremoved = 0;
+    dlist_mutable_iter    iter;
+
+    /* Return immediately if no pruning is wanted */
+    if (catalog_cache_prune_min_age == 0 ||
+        cp->cc_memusage <= (Size) catalog_cache_memory_target * 1024L)
+        return false;
+
+    /* Scan over LRU to find entries to remove */
+    dlist_foreach_modify(iter, &cp->cc_lru_list)
+    {
+        CatCTup    *ct = dlist_container(CatCTup, lru_node, iter.cur);
+        long        entry_age;
+        int            us;
+
+        /* We don't remove referenced entry */
+        if (ct->refcount != 0 ||
+            (ct->c_list && ct->c_list->refcount != 0))
+            continue;
+
+        /*
+         * Calculate the duration from the time from the last access to
+         * the "current" time. catcacheclock is updated per-statement
+         * basis and additionaly udpated periodically during a long
+         * running query.
+         */
+        TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+        if (entry_age < catalog_cache_prune_min_age)
+        {
+            /*
+             * no longer have a business with further entries, exit.  At least
+             * one removal is enough to prevent rehashing this time.
+             */
+            return nremoved > 0;
+        }
+
+        /*
+         * Entries that are not accessed after last pruning are removed in
+         * that seconds, and that has been accessed several times are
+         * removed after leaving alone for up to three times of the
+         * duration. We don't try shrink buckets since pruning effectively
+         * caps catcache expansion in the long term.
+         */
+        if (ct->naccess > 0)
+            ct->naccess--;
+        else
+        {
+            /* remove this entry */
+            CatCacheRemoveCTup(cp, ct);
+            nremoved++;
+        }
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -858,13 +1026,18 @@ RehashCatCache(CatCache *cp)
     dlist_head *newbucket;
     int            newnbuckets;
     int            i;
+    size_t        sz;
 
     elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets",
          cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets);
 
     /* Allocate a new, larger, hash table. */
     newnbuckets = cp->cc_nbuckets * 2;
-    newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head));
+    sz = newnbuckets * sizeof(dlist_head);
+    newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, sz);
+
+    /* reset memory usage */
+    cp->cc_memusage = cp->cc_head_size + sz;
 
     /* Move all entries from old hash table to new. */
     for (i = 0; i < cp->cc_nbuckets; i++)
@@ -878,6 +1051,7 @@ RehashCatCache(CatCache *cp)
 
             dlist_delete(iter.cur);
             dlist_push_head(&newbucket[hashIndex], &ct->cache_elem);
+            cp->cc_memusage += ct->size;
         }
     }
 
@@ -1260,6 +1434,21 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Update access information for pruning */
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        /*
+         * We don't want too frequent update of
+         * LRU. catalog_cache_prune_min_age can be changed on-session so we
+         * need to maintain the LRU regardless of catalog_cache_prune_min_age.
+         */
+        if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL)
+        {
+            ct->lastaccess = catcacheclock;
+            dlist_move_tail(&cache->cc_lru_list, &ct->lru_node);
+        }
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1695,6 +1884,11 @@ SearchCatCacheList(CatCache *cache,
         /* Now we can build the CatCList entry. */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
         nmembers = list_length(ctlist);
+
+        /*
+         * Don't waste a time by counting the list in catcache memory usage,
+         * since it doesn't live a long life.
+         */
         cl = (CatCList *)
             palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *));
 
@@ -1805,6 +1999,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize;
 
     /* negative entries have no tuple associated */
     if (ntp)
@@ -1828,8 +2023,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
@@ -1862,14 +2057,16 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
 
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
          */
-        CatCacheCopyKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno,
-                         arguments, ct->keys);
+        tupsize +=
+            CatCacheCopyKeys(cache->cc_tupdesc, cache->cc_nkeys,
+                             cache->cc_keyno, arguments, ct->keys);
         MemoryContextSwitchTo(oldcxt);
     }
 
@@ -1884,19 +2081,33 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    dlist_push_tail(&cache->cc_lru_list, &ct->lru_node);
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
+    ct->size = tupsize;
+    cache->cc_memusage += ct->size;
+
+    /* increase refcount so that this survives pruning */
+    ct->refcount++;
+
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try cleanup by removing
+     * infrequently used entries to make a room for the new entry. If it
+     * failed, enlarge the bucket array instead.  Quite arbitrarily, we try
+     * this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
+    ct->refcount--;
+
     return ct;
 }
 
@@ -1926,13 +2137,14 @@ CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys)
 /*
  * Helper routine that copies the keys in the srckeys array into the dstkeys
  * one, guaranteeing that the datums are fully allocated in the current memory
- * context.
+ * context. Returns allocated memory size.
  */
-static void
+static size_t
 CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                  Datum *srckeys, Datum *dstkeys)
 {
     int            i;
+    size_t        sz = 0;
 
     /*
      * XXX: memory and lookup performance could possibly be improved by
@@ -1961,8 +2173,13 @@ CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
         dstkeys[i] = datumCopy(src,
                                att->attbyval,
                                att->attlen);
+
+        /* approximate size */
+        if (!att->attbyval)
+            sz += VARHDRSZ + att->attlen;
     }
 
+    return sz;
 }
 
 /*
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..0e8b972a29 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t CatcacheClockTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..9eb50e9676 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void CatcacheClockTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(CATCACHE_CLOCK_TIMEOUT,
+                        CatcacheClockTimeoutHandler);
     }
 
     /*
@@ -1238,6 +1241,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+CatcacheClockTimeoutHandler(void)
+{
+    CatcacheClockTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 156d147c85..d863c8dec8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -81,6 +81,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2205,6 +2206,28 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum unused duration of cache entries before removal."),
+            gettext_noop("Catalog cache entries that live unused for longer than this seconds are considered to be
removed."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
+    {
+        {"catalog_cache_memory_target", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("Sets the minimum syscache size to keep."),
+            gettext_noop("Time-based cache pruning starts working after exceeding this size."),
+            GUC_UNIT_KB
+        },
+        &catalog_cache_memory_target,
+        0, 0, MAX_KILOBYTES,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 194f312096..7c82b0eca7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,8 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#catalog_cache_memory_target = 0kB    # in kB
+#catalog_cache_prune_min_age = 300s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..33b800e80f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..1ae49b4819 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,10 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    dlist_head    cc_lru_list;
+    int            cc_head_size;    /* memory usage of catcache header */
+    int            cc_memusage;    /* total memory usage of this catcache  */
+    int            cc_nfreeent;    /* # of entries currently not referenced */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,7 +124,10 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
-
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* approx. timestamp of the last usage */
+    dlist_node    lru_node;        /* LRU node */
+    int            size;            /* palloc'ed size off this tuple */
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
      * catcache is list-searched with varying numbers of keys, we may have to
@@ -189,6 +197,39 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+extern int catalog_cache_memory_target;
+extern int catalog_cache_entry_limit;
+extern double catalog_cache_prune_ratio;
+
+/* to use as access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/*
+ * Flag to keep track of whether catcache timestamp timer is active.
+ */
+extern bool catcache_clock_timeout_active;
+
+/* catcache prune time helper functions  */
+extern void SetupCatCacheClockTimer(void);
+extern void UpdateCatCacheClock(void);
+
+/*
+ * SetCatCacheClock - set timestamp for catcache access record and start
+ * maintenance timer if needed. We keep to update the clock even while pruning
+ * is disable so that we are not confused by bogus clock value.
+ */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+
+    if (!catcache_clock_timeout_active && catalog_cache_prune_min_age > 0)
+        SetupCatCacheClockTimer();
+}
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..b2d97b4f7b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    CATCACHE_CLOCK_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Tue, Feb 19, 2019 at 11:15 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Difference from v15:
>
>   Removed AllocSet accounting stuff. We use approximate memory
>   size for catcache.
>
>   Removed prune-by-number(or size) stuff.
>
>   Adressing comments from Tsunakawa-san and Ideriha-san .
>
>   Separated catcache monitoring feature. (Removed from this set)
>     (But it is crucial to check this feature...)
>
> Is this small enough ?

The commit message in 0002 says 'This also can put a hard limit on the
number of catcache entries.' but neither of the GUCs that you've
documented have that effect.  Is that a leftover from a previous
version?

I'd like to see some evidence that catalog_cache_memory_target has any
value, vs. just always setting it to zero.  I came up with the
following somewhat artificial example that shows that it might have
value.

rhaas=# create table foo (a int primary key, b text) partition by hash (a);
[rhaas pgsql]$ perl -e 'for (0..9999) { print "CREATE TABLE foo$_
PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; }'
| psql

First execution of 'select * from foo' in a brand new session takes
about 1.9 seconds; subsequent executions take about 0.7 seconds.  So,
if catalog_cache_memory_target were set to a high enough value to
allow all of that stuff to remain in cache, we could possibly save
about 1.2 seconds coming off the blocks after a long idle period.
That might be enough to justify having the parameter.  But I'm not
quite sure how high the value would need to be set to actually get the
benefit in a case like that, or what happens if you set it to a value
that's not quite high enough.  I think it might be good to play around
some more with cases like this, just to get a feeling for how much
time you can save in exchange for how much memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Tsunakawa, Takayuki
>>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
>> number of tables   | 100  |1000    |10000
>> -----------------------------------------------------------
>> TPS (master)       |10966  |10654 |9099
>> TPS (patch)        | 11137 (+1%) |10710 (+0%) |772 (-91%)
>>
>> It seems that before cache exceeding the limit (no pruning at 100 and
>> 1000), the results are almost same with master but after exceeding the
>> limit (at
>> 10000)
>> the decline happens.
>
>How many concurrent clients?
One client (default setting). 

>Can you show the perf's call graph sampling profiles of both the unpatched and
>patched version, to confirm that the bottleneck is around catcache eviction and refill?

I checked it with perf record -avg and perf report. 
The following shows top 20 symbols during benchmark including kernel space.
The main difference between master (unpatched) and patched one seems that
patched one consumes cpu catcache-evict-and-refill functions including 
SearchCatCacheMiss(),  CatalogCacheCreateEntry(), CatCacheCleanupOldEntries().
So it seems to me that these functions needs further inspection 
to suppress the performace decline as much as possible 

Master(%) master    |patch (%)    patch
51.25%    cpu_startup_entry    |    51.45%    cpu_startup_entry
51.13%    arch_cpu_idle    |    51.19%    arch_cpu_idle
51.13%    default_idle    |    51.19%    default_idle
51.13%    native_safe_halt    |    50.95%    native_safe_halt
36.27%    PostmasterMain    |    46.98%    PostmasterMain
36.27%    main    |    46.98%    main
36.27%    __libc_start_main    |    46.98%    __libc_start_main
36.07%    ServerLoop    |    46.93%    ServerLoop
35.75%    PostgresMain    |    46.89%    PostgresMain
26.03%    exec_simple_query    |    45.99%    exec_simple_query
26.00%    rest_init    |    43.40%    SearchCatCacheMiss
26.00%    start_kernel    |    42.80%    CatalogCacheCreateEntry
26.00%    x86_64_start_reservations    |    42.75%    CatCacheCleanupOldEntries
26.00%    x86_64_start_kernel    |    27.04%    rest_init
25.26%    start_secondary    |    27.04%    start_kernel
10.25%    pg_plan_queries    |    27.04%    x86_64_start_reservations
10.17%    pg_plan_query    |    27.04%    x86_64_start_kernel
10.16%    main    |    24.42%    start_secondary
10.16%    __libc_start_main    |    22.35%    pg_analyze_and_rewrite
10.03%    standard_planner    |    22.35%    parse_analyze

Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Ideriha, Takeshi/出利葉 健
> I checked it with perf record -avg and perf report.
> The following shows top 20 symbols during benchmark including kernel space.
> The main difference between master (unpatched) and patched one seems that
> patched one consumes cpu catcache-evict-and-refill functions including
> SearchCatCacheMiss(),  CatalogCacheCreateEntry(),
> CatCacheCleanupOldEntries().
> So it seems to me that these functions needs further inspection
> to suppress the performace decline as much as possible

Thank you.  It's good to see the expected functions, rather than strange behavior.  The performance drop is natural
justlike the database cache's hit ratio is low.  The remedy for performance by the user is also the same as the
databasecache -- increase the catalog cache.
 


Regards
Takayuki Tsunakawa





RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> That might be enough to justify having the parameter.  But I'm not 
> quite sure how high the value would need to be set to actually get the 
> benefit in a case like that, or what happens if you set it to a value 
> that's not quite high enough.  I think it might be good to play around 
> some more with cases like this, just to get a feeling for how much 
> time you can save in exchange for how much memory.

Why don't we consider this just like the database cache and other DBMS's dictionary caches?  That is,

* If you want to avoid infinite memory bloat, set the upper limit on size.

* To find a better limit, check the hit ratio with the statistics view (based on Horiguchi-san's original 0004 patch,
althoughthat seems modification anyway)
 


Why do people try to get away from a familiar idea...  Am I missing something?

Ideriha-san,
Could you try simplifying the v15 patch set to see how simple the code would look or not?  That is:

* 0001: add dlist_push_tail() ... as is
* 0002: memory accounting, with correction based on feedback
* 0003: merge the original 0003 and 0005, with correction based on feedback


Regards
Takayuki Tsunakawa


Re: Protect syscache from bloating with negative cache entries

От
'Bruce Momjian'
Дата:
On Tue, Feb 19, 2019 at 07:08:14AM +0000, Tsunakawa, Takayuki wrote:
> We all have to manage things within resource constraints.  The DBA
> wants to make sure the server doesn't overuse memory to avoid crash
> or slowdown due to swapping.  Oracle does it, and another open source
> database, MySQL, does it too.  PostgreSQL does it with shared_buffers,
> wal_buffers, and work_mem (within a single session).  Then, I thought
> it's natural to do it with catcache/relcache/plancache.

I already addressed these questions in an email from Feb 14:

    https://www.postgresql.org/message-id/20190214154955.GB19578@momjian.us

I understand the operational needs of limiting resources in some cases,
but there is also the history of OS's using working set to allocate
things, which didn't work too well:

    https://en.wikipedia.org/wiki/Working_set

I think we need to address the most pressing problem of unlimited cache size
bloat and then take a holistic look at all memory allocation.  If we
are going to address that in a global way, I don't see the relation
cache as the place to start.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Tsunakawa, Takayuki
>Ideriha-san,
>Could you try simplifying the v15 patch set to see how simple the code would look or
>not?  That is:
>
>* 0001: add dlist_push_tail() ... as is
>* 0002: memory accounting, with correction based on feedback
>* 0003: merge the original 0003 and 0005, with correction based on feedback

Attached are simpler version based on Horiguchi san's ver15 patch, 
which means cache is pruned by both time and size.
(Still cleanup function is complex but it gets much simpler.)

Regards,
Takeshi Ideriha

Вложения

Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Feb 21, 2019 at 1:38 AM Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> Why don't we consider this just like the database cache and other DBMS's dictionary caches?  That is,
>
> * If you want to avoid infinite memory bloat, set the upper limit on size.
>
> * To find a better limit, check the hit ratio with the statistics view (based on Horiguchi-san's original 0004 patch,
althoughthat seems modification anyway)
 
>
> Why do people try to get away from a familiar idea...  Am I missing something?

I don't understand the idea that we would add something to PostgreSQL
without proving that it has value.  Sure, other systems have somewhat
similar systems, and they have knobs to tune them.  But, first, we
don't know that those other systems made all the right decisions, and
second, even they are, that doesn't mean that we'll derive similar
benefits in a system with a completely different code base and many
other internal differences.

You need to demonstrate that each and every GUC you propose to add has
a real, measurable benefit in some plausible scenario.  You can't just
argue that other people have something kinda like this so we should
have it too.  Or, well, you can argue that, but if you do, then -1
from me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Wed, 20 Feb 2019 13:09:08 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZXw+SwK_9Tp=wLqZDstW_X+Ant=rd7K+q4zmYONPuL=w@mail.gmail.com>
> On Tue, Feb 19, 2019 at 11:15 PM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Difference from v15:
> >
> >   Removed AllocSet accounting stuff. We use approximate memory
> >   size for catcache.
> >
> >   Removed prune-by-number(or size) stuff.
> >
> >   Adressing comments from Tsunakawa-san and Ideriha-san .
> >
> >   Separated catcache monitoring feature. (Removed from this set)
> >     (But it is crucial to check this feature...)
> >
> > Is this small enough ?
> 
> The commit message in 0002 says 'This also can put a hard limit on the
> number of catcache entries.' but neither of the GUCs that you've
> documented have that effect.  Is that a leftover from a previous
> version?

Mmm. Right. Thank you for pointing that and sorry for that. Fixed
it including another mistake in the commit message in my repo. It
will appear in the next version.

| Remove entries that haven't been used for a certain time
| 
| Catcache entries can be left alone for several reasons. It is not
| desirable that they eat up memory. With this patch, entries that
| haven't been used for a certain time are considered to be removed
| before enlarging hash array.

> I'd like to see some evidence that catalog_cache_memory_target has any
> value, vs. just always setting it to zero.  I came up with the
> following somewhat artificial example that shows that it might have
> value.
> 
> rhaas=# create table foo (a int primary key, b text) partition by hash (a);
> [rhaas pgsql]$ perl -e 'for (0..9999) { print "CREATE TABLE foo$_
> PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; }'
> | psql
> 
> First execution of 'select * from foo' in a brand new session takes
> about 1.9 seconds; subsequent executions take about 0.7 seconds.  So,
> if catalog_cache_memory_target were set to a high enough value to
> allow all of that stuff to remain in cache, we could possibly save
> about 1.2 seconds coming off the blocks after a long idle period.
> That might be enough to justify having the parameter.  But I'm not
> quite sure how high the value would need to be set to actually get the
> benefit in a case like that, or what happens if you set it to a value
> that's not quite high enough.

It is artificial (or acutually wont't be repeatedly executed in a
session) but anyway what can get benefit from
catalog_cache_memory_target would be a kind of extreme.

I think the two parameters are to be tuned in the following
steps.

- If the default setting sutisfies you, leave it alone. (as a
  general suggestion)

- If you find your (syscache-sensitive) query are to be executed
  with rather longer intervals, say 10-30 minutes, and it gets
  slower than shorter intervals, consider increase
  catalog_cache_prune_min_age to about the query interval. If you
  don't suffer process-bloat, that's fine.

- If you find the process too much "bloat"s and you (intuirively)
  suspect the cause is system cache, set it to certain shorter
  value, say 1 minutes, and set the catalog_cache_memory_target
  to allowable amount of memory for each process. The memory
  usage will be stable at (un)certain amount above the target.


Or, if you want determine the setting previously with rather
strict limit, and if the monitoring feature were a part of this
patchset, a user can check how much memory is used for the query.

$ perl -e 'print "set track_catalog_cache_usage_interval = 1000;\n"; for (0..9999) { print "CREATE TABLE foo$_
PARTITIONOF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; } print "select sum(size) from pg_stat_syscache";' |
psql

   sum   
---------
 7088523

In this case, set catalog_cache_memory_target to 7MB and
catalog_cache_memory_target to '1min'. Since the target doesn't
work strictly (checked only at every resizing time), possibly
you need further tuning.

> that's not quite high enough.  I think it might be good to play around
> some more with cases like this, just to get a feeling for how much
> time you can save in exchange for how much memory.

All kind of tuning is something of that kind, I think.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Mon, 25 Feb 2019 15:23:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190225.152322.104148315.horiguchi.kyotaro@lab.ntt.co.jp>
> I think the two parameters are to be tuned in the following
> steps.
> 
> - If the default setting sutisfies you, leave it alone. (as a
>   general suggestion)
> 
> - If you find your (syscache-sensitive) query are to be executed
>   with rather longer intervals, say 10-30 minutes, and it gets
>   slower than shorter intervals, consider increase
>   catalog_cache_prune_min_age to about the query interval. If you
>   don't suffer process-bloat, that's fine.
> 
> - If you find the process too much "bloat"s and you (intuirively)
>   suspect the cause is system cache, set it to certain shorter
>   value, say 1 minutes, and set the catalog_cache_memory_target
>   to allowable amount of memory for each process. The memory
>   usage will be stable at (un)certain amount above the target.
> 
> 
> Or, if you want determine the setting previously with rather
> strict limit, and if the monitoring feature were a part of this
> patchset, a user can check how much memory is used for the query.
> 
> $ perl -e 'print "set track_catalog_cache_usage_interval = 1000;\n"; for (0..9999) { print "CREATE TABLE foo$_
PARTITIONOF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; } print "select sum(size) from pg_stat_syscache";' |
psql
> 
>    sum   
> ---------
>  7088523


It's not substantial, but the number is for
catalog_cache_prune_min_age = 300s, I had 12MB when it is
disabled.

perl -e 'print "set catalog_cache_prune_min_age to 0; set track_catalog_cache_usage_interval = 1000;\n"; for (0..9999)
{print "CREATE TABLE foo$_ PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; } print "select
sum(size)from pg_stat_syscache";' | psql
 

   sum    
----------
 12642321

> In this case, set catalog_cache_memory_target to 7MB and
> catalog_cache_memory_target to '1min'. Since the target doesn't
> work strictly (checked only at every resizing time), possibly
> you need further tuning.

regards.

- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Robert Haas [mailto:robertmhaas@gmail.com]
> I don't understand the idea that we would add something to PostgreSQL
> without proving that it has value.  Sure, other systems have somewhat
> similar systems, and they have knobs to tune them.  But, first, we
> don't know that those other systems made all the right decisions, and
> second, even they are, that doesn't mean that we'll derive similar
> benefits in a system with a completely different code base and many
> other internal differences.

I understand that general idea.  So, I don't have an idea why the proposed approach, eviction based only on elapsed
timeonly at hash table expansion, is better for PostgreSQL's code base and other internal differences...
 


> You need to demonstrate that each and every GUC you propose to add has
> a real, measurable benefit in some plausible scenario.  You can't just
> argue that other people have something kinda like this so we should
> have it too.  Or, well, you can argue that, but if you do, then -1
> from me.

The benefit of the size limit are:
* Controllable and predictable memory usage.  The DBA can be sure that OOM won't happen.
* Smoothed (non-abnormal) transaction response time.  This is due to the elimination of bulk eviction of cache
entries.


I'm not sure how to tune catalog_cache_prune_min_age and catalog_cache_memory_target.  Let me pick up a test scenario
ina later mail in response to Horiguchi-san.
 


Regards
Takayuki Tsunakawa



RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> - If you find the process too much "bloat"s and you (intuirively)
>   suspect the cause is system cache, set it to certain shorter
>   value, say 1 minutes, and set the catalog_cache_memory_target
>   to allowable amount of memory for each process. The memory
>   usage will be stable at (un)certain amount above the target.

Could you guide me how to tune these parameters in an example scenario?  Let me take the original problematic case
referencedat the beginning of this thread.  That is:
 

* A PL/pgSQL function that creates a temp table, accesses it, (accesses other non-temp tables), and drop the temp
table.
* An application repeatedly begins a transaction, calls the stored function, and commits the transaction.

With v16 patch applied, and leaving the catalog_cache_xxx parameters set to their defaults, CacheMemoryContext
continuedto increase as follows:
 

CacheMemoryContext: 1065016 total in 9 blocks; 104168 free (17 chunks); 960848 used
CacheMemoryContext: 8519736 total in 12 blocks; 3765504 free (19 chunks); 4754232 used
CacheMemoryContext: 25690168 total in 14 blocks; 8372096 free (21 chunks); 17318072 used
CacheMemoryContext: 42991672 total in 16 blocks; 11741024 free (21761 chunks); 31250648 used

How can I make sure that this context won't exceed, say, 10 MB to avoid OOM?

I'm afraid that once the catcache hash table becomes large in a short period, the eviction would happen less
frequently,leading to memory bloat.
 


Regards
Takayuki Tsunakawa




RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>>From: Tsunakawa, Takayuki
>>Ideriha-san,
>>Could you try simplifying the v15 patch set to see how simple the code
>>would look or not?  That is:
>>
>>* 0001: add dlist_push_tail() ... as is
>>* 0002: memory accounting, with correction based on feedback
>>* 0003: merge the original 0003 and 0005, with correction based on
>>feedback
>
>Attached are simpler version based on Horiguchi san's ver15 patch, which means
>cache is pruned by both time and size.
>(Still cleanup function is complex but it gets much simpler.)

I don't mean to disregard what Horiguchi san and others have developed and discussed. 
But I refactored again the v15 patch to reduce complexity of v15 patch
because it seems to me one of the reason for dropping feature for pruning by size stems from
code complexity.

Another thing is there's been discussed about over memory accounting overhead but
the overhead effect hasn't been measured in this thread. So I'd like to measure it.

Regards,
Takeshi Ideriha

Вложения

Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Mon, Feb 25, 2019 at 3:50 AM Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> How can I make sure that this context won't exceed, say, 10 MB to avoid OOM?

As Tom has said before and will probably say again, I don't think you
actually want that.  We know that PostgreSQL gets roughly 100x slower
with the system caches disabled - try running with
CLOBBER_CACHE_ALWAYS.  If you are accessing the same system cache
entries repeatedly in a loop - which is not at all an unlikely
scenario, just run the same query or sequence of queries in a loop -
and if the number of entries exceeds 10MB even, perhaps especially, by
just a tiny bit, you are going to see a massive performance hit.
Maybe it won't be 100x because some more-commonly-used entries will
always stay cached, but it's going to be really big, I think.

Now you could say - well it's still better than running out of memory.
However, memory usage is quite unpredictable.  It depends on how many
backends are active and how many copies of work_mem and/or
maintenance_work_mem are in use, among other things.  I don't think we
can say that just imposing a limit on the size of the system caches is
going to be enough to reliably prevent an out of memory condition
unless the other use of memory on the machine happens to be extremely
stable.

So I think what's going to happen if you try to impose a hard-limit on
the size of the system cache is that you will cause some workloads to
slow down by 3x or more without actually preventing out of memory
conditions.  What you need to do is accept that system caches need to
grow as big as they need to grow, and if that causes you to run out of
memory, either buy more memory or reduce the number of concurrent
sessions you allow.  It would be fine to instead limit the cache
memory if those cache entries only had a mild effect on performance,
but I don't think that's the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Mon, Feb 25, 2019 at 1:27 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I'd like to see some evidence that catalog_cache_memory_target has any
> > value, vs. just always setting it to zero.
>
> It is artificial (or acutually wont't be repeatedly executed in a
> session) but anyway what can get benefit from
> catalog_cache_memory_target would be a kind of extreme.

I agree.  So then let's not have it.

We shouldn't add more mechanism here than actually has value.  It
seems pretty clear that keeping cache entries that go unused for long
periods can't be that important; even if we need them again
eventually, reloading them every 5 or 10 minutes can't hurt that much.
On the other hand, I think it's also pretty clear that evicting cache
entries that are being used frequently will have disastrous effects on
performance; as I noted in the other email I just sent, consider the
effects of CLOBBER_CACHE_ALWAYS.  No reasonable user is going to want
to incur a massive slowdown to save a little bit of memory.

I see that *in theory* there is a value to
catalog_cache_memory_target, because *maybe* there is a workload where
tuning that GUC will lead to better performance at lower memory usage
than any competing proposal.  But unless we can actually see an
example of such a workload, which so far I don't, we're adding a knob
that everybody has to think about how to tune when in fact we have no
idea how to tune it or whether it even needs to be tuned.  That
doesn't make sense.  We have to be able to document the parameters we
have and explain to users how they should be used.  And as far as this
parameter is concerned I think we are not at that point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
>>>* 0001: add dlist_push_tail() ... as is
>>>* 0002: memory accounting, with correction based on feedback
>>>* 0003: merge the original 0003 and 0005, with correction based on
>>>feedback
>>
>>Attached are simpler version based on Horiguchi san's ver15 patch,
>>which means cache is pruned by both time and size.
>>(Still cleanup function is complex but it gets much simpler.)
>
>I don't mean to disregard what Horiguchi san and others have developed and
>discussed.
>But I refactored again the v15 patch to reduce complexity of v15 patch because it
>seems to me one of the reason for dropping feature for pruning by size stems from
>code complexity.
>
>Another thing is there's been discussed about over memory accounting overhead but
>the overhead effect hasn't been measured in this thread. So I'd like to measure it.

I measured the memory context accounting overhead using Tomas's tool palloc_bench, 
which he made it a while ago in the similar discussion.
https://www.postgresql.org/message-id/53F7E83C.3020304@fuzzy.cz 

This tool is a little bit outdated so I fixed it but basically I followed him.
Things I did:
- make one MemoryContext
- run both palloc() and pfree() for 32kB area 1,000,000 times. 
- And measure this time 

The result shows that master is 30 times faster than patched one.
So as Andres mentioned in upper thread it seems it has overhead.

[master (without v15 patch)]
61.52 ms
60.96 ms
61.40 ms
61.42 ms
61.14 ms

[with v15 patch]
1838.02 ms
1754.84 ms
1755.83 ms
1789.69 ms
1789.44 ms

Regards,
Takeshi Ideriha

RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Robert Haas [mailto:robertmhaas@gmail.com]
>
>On Mon, Feb 25, 2019 at 3:50 AM Tsunakawa, Takayuki
><tsunakawa.takay@jp.fujitsu.com> wrote:
>> How can I make sure that this context won't exceed, say, 10 MB to avoid OOM?
>
>As Tom has said before and will probably say again, I don't think you actually want that.
>We know that PostgreSQL gets roughly 100x slower with the system caches disabled
>- try running with CLOBBER_CACHE_ALWAYS.  If you are accessing the same system
>cache entries repeatedly in a loop - which is not at all an unlikely scenario, just run the
>same query or sequence of queries in a loop - and if the number of entries exceeds
>10MB even, perhaps especially, by just a tiny bit, you are going to see a massive
>performance hit.
>Maybe it won't be 100x because some more-commonly-used entries will always stay
>cached, but it's going to be really big, I think.
>
>Now you could say - well it's still better than running out of memory.
>However, memory usage is quite unpredictable.  It depends on how many backends
>are active and how many copies of work_mem and/or maintenance_work_mem are in
>use, among other things.  I don't think we can say that just imposing a limit on the
>size of the system caches is going to be enough to reliably prevent an out of memory
>condition unless the other use of memory on the machine happens to be extremely
>stable.

>So I think what's going to happen if you try to impose a hard-limit on the size of the
>system cache is that you will cause some workloads to slow down by 3x or more
>without actually preventing out of memory conditions.  What you need to do is accept
>that system caches need to grow as big as they need to grow, and if that causes you
>to run out of memory, either buy more memory or reduce the number of concurrent
>sessions you allow.  It would be fine to instead limit the cache memory if those cache
>entries only had a mild effect on performance, but I don't think that's the case.


I'm afraid I may be quibbling about it.
What about users who understand performance drops but don't want to 
add memory or decrease concurrency?
I think that PostgreSQL has a parameter
which most of users don't mind and use is as default 
but a few of users want to change it.
In this case as you said, introducing hard limit parameter causes
performance decrease significantly so how about adding detailed caution
to the document like planner cost parameter?

Regards,
Takeshi Ideriha

RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
> I measured the memory context accounting overhead using Tomas's tool
> palloc_bench,
> which he made it a while ago in the similar discussion.
> https://www.postgresql.org/message-id/53F7E83C.3020304@fuzzy.cz
> 
> This tool is a little bit outdated so I fixed it but basically I followed
> him.
> Things I did:
> - make one MemoryContext
> - run both palloc() and pfree() for 32kB area 1,000,000 times.
> - And measure this time
> 
> The result shows that master is 30 times faster than patched one.
> So as Andres mentioned in upper thread it seems it has overhead.
> 
> [master (without v15 patch)]
> 61.52 ms
> 60.96 ms
> 61.40 ms
> 61.42 ms
> 61.14 ms
> 
> [with v15 patch]
> 1838.02 ms
> 1754.84 ms
> 1755.83 ms
> 1789.69 ms
> 1789.44 ms
> 

I'm afraid the measurement is not correct.  First, the older discussion below shows that the accounting overhead is
much,much smaller, even with a more complex accounting.
 

9.5: Better memory accounting, towards memory-bounded HashAg
https://www.postgresql.org/message-id/flat/1407012053.15301.53.camel%40jeff-desktop

Second, allocation/free of memory > 8 KB calls malloc()/free().  I guess the accounting overhead will be more likely to
behidden under the overhead of malloc() and free().  What we'd like to know the overhead when malloc() and free() are
notcalled.
 

And are you sure you didn't enable assert checking?


Regards
Takayuki Tsunakawa



RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
>> I measured the memory context accounting overhead using Tomas's tool
>> palloc_bench, which he made it a while ago in the similar discussion.
>> https://www.postgresql.org/message-id/53F7E83C.3020304@fuzzy.cz
>>
>> This tool is a little bit outdated so I fixed it but basically I
>> followed him.
>> Things I did:
>> - make one MemoryContext
>> - run both palloc() and pfree() for 32kB area 1,000,000 times.
>> - And measure this time

>And are you sure you didn't enable assert checking?
Ah, sorry.. I misconfigured it. 

>I'm afraid the measurement is not correct.  First, the older discussion below shows
>that the accounting overhead is much, much smaller, even with a more complex
>accounting.
>Second, allocation/free of memory > 8 KB calls malloc()/free().  I guess the
>accounting overhead will be more likely to be hidden under the overhead of malloc()
>and free().  What we'd like to know the overhead when malloc() and free() are not
>called.

Here is the average of 50 times measurement. 
Palloc-pfree for 800byte with 1,000,000 times, and 32kB with 1,000,000 times.
I checked malloc is not called at size=800 using gdb.

[Size=800, iter=1,000,000]
Master |15.763
Patched|16.262 (+3%)

[Size=32768, iter=1,000,000]
Master |61.3076
Patched|62.9566 (+2%)

At least compared to previous HashAg version, the overhead is smaller.
It has some overhead but is increase by 2 or 3% a little bit?

Regards,
Takeshi Ideriha

Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Wed, Feb 27, 2019 at 3:16 AM Ideriha, Takeshi
<ideriha.takeshi@jp.fujitsu.com> wrote:
> I'm afraid I may be quibbling about it.
> What about users who understand performance drops but don't want to
> add memory or decrease concurrency?
> I think that PostgreSQL has a parameter
> which most of users don't mind and use is as default
> but a few of users want to change it.
> In this case as you said, introducing hard limit parameter causes
> performance decrease significantly so how about adding detailed caution
> to the document like planner cost parameter?

There's nothing wrong with a parameter that is useful to some people
and harmless to everyone else, but the people who are proposing that
parameter still have to demonstrate that it has those properties.
This email thread is really short on clear demonstrations that X or Y
is useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: Protect syscache from bloating with negative cache entries

От
"Tsunakawa, Takayuki"
Дата:
From: Ideriha, Takeshi/出利葉 健
> [Size=800, iter=1,000,000]
> Master |15.763
> Patched|16.262 (+3%)
> 
> [Size=32768, iter=1,000,000]
> Master |61.3076
> Patched|62.9566 (+2%)

What's the unit, second or millisecond?
Why is the number of digits to the right of the decimal point?

Is the measurement correct?  I'm wondering because the difference is larger in the latter case.  Isn't the accounting
processingalmost the sane in both cases?
 
* former: 16.262 - 15.763 = 4.99
* latter: 62.956 - 61.307 = 16.49


> At least compared to previous HashAg version, the overhead is smaller.
> It has some overhead but is increase by 2 or 3% a little bit?

I think the overhead is sufficiently small.  It may get even smaller with a trivial tweak.

You added the new member usedspace at the end of MemoryContextData.  The original size of MemoryContextData is 72
bytes,and Intel Xeon's cache line is 64 bytes.  So, the new member will be on a separate cache line.  Try putting
usedspacebefore the name member.
 


Regards
Takayuki Tsunakawa


Re: Protect syscache from bloating with negative cache entries

От
Vladimir Sitnikov
Дата:
Robert> This email thread is really short on clear demonstrations that X or Y
Robert> is useful.

It is useful when the whole database does **not** crash, isn't it?

Case A (==current PostgeSQL mode): syscache grows, then OOMkiller
chimes in, kills the database process, and it leads to the complete
cluster failure (all other PG processes terminate themselves).

Case B (==limit syscache by 10MiB or whatever as Tsunakawa, Takayuki
asks):  a single ill-behaved process works a bit slower and/or
consumers more CPU than the other ones. The whole DB is still alive.

I'm quite sure "case B" is much better for the end users and for the
database administrators.

So, +1 to Tsunakawa, Takayuki, it would be so great if there was a way
to limit the memory consumption of a single process (e.g. syscache,
workmem, etc, etc).

Robert> However, memory usage is quite unpredictable.  It depends on how many
Robert> backends are active

The number of backends can be limited by ensuring a proper limits at
application connection pool level and/or pgbouncer and/or things like
that.

Robert>how many copies of work_mem and/or
Robert> maintenance_work_mem are in use

There might be other patches to cap the total use of
work_mem/maintenance_work_mem,

Robert>I don't think we
Robert> can say that just imposing a limit on the size of the system caches is
Robert> going to be enough to reliably prevent an out of memory condition

The less possibilities there are for OOM the better. Quite often it is
much better to fail a single SQL rather than kill all the DB
processes.

Vladimir


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Tue, 26 Feb 2019 10:55:18 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+Tgmoa2b-LUF9h3wugD9ZA5MP0xyu2kJYHC9L6sdLywNSmhBQ@mail.gmail.com>
> On Mon, Feb 25, 2019 at 1:27 AM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > I'd like to see some evidence that catalog_cache_memory_target has any
> > > value, vs. just always setting it to zero.
> >
> > It is artificial (or acutually wont't be repeatedly executed in a
> > session) but anyway what can get benefit from
> > catalog_cache_memory_target would be a kind of extreme.
> 
> I agree.  So then let's not have it.

Ah... Yeah! I see. Andres' concern was that crucial syscache
entries might be blown away during a long idle time. If that
happens, it's enough to just turn off in the almost all of such
cases.

We no longer need to count memory usage without the feature. That
sutff is moved to monitoring feature, which is out of the scope
of the current status of this patch.

> We shouldn't add more mechanism here than actually has value.  It
> seems pretty clear that keeping cache entries that go unused for long
> periods can't be that important; even if we need them again
> eventually, reloading them every 5 or 10 minutes can't hurt that much.
> On the other hand, I think it's also pretty clear that evicting cache
> entries that are being used frequently will have disastrous effects on
> performance; as I noted in the other email I just sent, consider the
> effects of CLOBBER_CACHE_ALWAYS.  No reasonable user is going to want
> to incur a massive slowdown to save a little bit of memory.
> 
> I see that *in theory* there is a value to
> catalog_cache_memory_target, because *maybe* there is a workload where
> tuning that GUC will lead to better performance at lower memory usage
> than any competing proposal.  But unless we can actually see an
> example of such a workload, which so far I don't, we're adding a knob
> that everybody has to think about how to tune when in fact we have no
> idea how to tune it or whether it even needs to be tuned.  That
> doesn't make sense.  We have to be able to document the parameters we
> have and explain to users how they should be used.  And as far as this
> parameter is concerned I think we are not at that point.

In the attached v18,
   catalog_cache_memory_target is removed,
   removed some leftover of removing the hard limit feature, 
   separated catcache clock update during a query into 0003.
   attached 0004 (monitor part) in order just to see how it is working.

v18-0001-Add-dlist_move_tail:
  Just adds dlist_move_tail

v18-0002-Remove-entries-that-haven-t-been-used-for-a-certain-:
  Revised pruning feature.

====
v18-0003-Asynchronous-update-of-catcache-clock:
  Separated catcache clock update feature.

v18-0004-Syscache-usage-tracking-feature:
  Usage tracking feature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 54388a7452eda1faadaa108e1bc21d51844f9224 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/6] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
     dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+    /* fast path if it's already at the tail */
+    if (head->head.prev == node)
+        return;
+
+    dlist_delete(node);
+    dlist_push_tail(head, node);
+
+    dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

From c79d5fc86f45e6545cbc257040e46125ffc5cb92 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 1 Mar 2019 13:32:51 +0900
Subject: [PATCH 2/6] Remove entries that haven't been used for a certain time

Catcache entries happen to be left alone for several reasons. It is
not desirable that such useless entries eat up memory. Catcache
pruning feature removes entries that haven't been accessed for a
certain time before enlarging hash array.
---
 doc/src/sgml/config.sgml                      |  19 ++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/cache/catcache.c            | 122 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/utils/catcache.h                  |  18 ++++
 6 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6d42b7afe7..737a156bb4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1661,6 +1661,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age">
+      <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_prune_min_age</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+         Specifies the minimum amount of unused time in seconds at which a
+         system catalog cache entry is removed. -1 indicates that this feature
+         is disabled at all. The value defaults to 300 seconds (<literal>5
+         minutes</literal>). The entries that are not used for the duration
+         can be removed to prevent catalog cache from bloating with useless
+         entries.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..02b9ef98aa 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2584,6 +2585,7 @@ start_xact_command(void)
      * not desired, the timeout has to be disabled explicitly.
      */
     enable_statement_timeout();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 static void
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 78dd5714fa..4386957497 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -39,6 +39,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -61,9 +62,24 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
+/*
+ * Minimum interval between two successive moves of a cache entry in LRU list,
+ * in microseconds.
+ */
+#define MIN_LRU_UPDATE_INTERVAL 100000    /* 100ms */
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -469,6 +485,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 
     /* delink from linked list */
     dlist_delete(&ct->cache_elem);
+    dlist_delete(&ct->lru_node);
 
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
@@ -829,6 +846,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    dlist_init(&cp->cc_lru_list);
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -846,9 +864,83 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            nremoved = 0;
+    dlist_mutable_iter    iter;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age == 0)
+        return false;
+
+    /* Scan over LRU to find entries to remove */
+    dlist_foreach_modify(iter, &cp->cc_lru_list)
+    {
+        CatCTup    *ct = dlist_container(CatCTup, lru_node, iter.cur);
+        long        entry_age;
+        int            us;
+
+        /* Don't remove referenced entries */
+        if (ct->refcount != 0 ||
+            (ct->c_list && ct->c_list->refcount != 0))
+            continue;
+
+        /*
+         * Calculate the duration from the time from the last access to
+         * the "current" time. catcacheclock is updated per-statement
+         * basis.
+         */
+        TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+        if (entry_age < catalog_cache_prune_min_age)
+        {
+            /*
+             * We don't have older entries, exit.  At least one removal
+             * prevents rehashing this time.
+             */
+            break;
+        }
+
+        /*
+         * Entries that are not accessed after the last pruning are removed in
+         * that seconds, and their lives are prolonged according to how many
+         * times they are accessed up to three times of the duration. We don't
+         * try shrink buckets since pruning effectively caps catcache
+         * expansion in the long term.
+         */
+        if (ct->naccess > 0)
+            ct->naccess--;
+        else
+        {
+            CatCacheRemoveCTup(cp, ct);
+            nremoved++;
+        }
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1260,6 +1352,20 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* prolong life of this entry */
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        /*
+         * Don't update LRU too frequently. We need to maintain the LRU even
+         * if pruning is inactive since it can be turned on on-session.
+         */
+        if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL)
+        {
+            ct->lastaccess = catcacheclock;
+            dlist_move_tail(&cache->cc_lru_list, &ct->lru_node);
+        }
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1884,19 +1990,29 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    dlist_push_tail(&cache->cc_lru_list, &ct->lru_node);
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
+    /* increase refcount so that the new entry survives pruning */
+    ct->refcount++;
+
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try removing infrequently used
+     * entries to make a room for the new entry. If failed, enlarge the bucket
+     * array instead.  Quite arbitrarily, we try this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
+    ct->refcount--;
+
     return ct;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 156d147c85..3acc86cd07 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -81,6 +81,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2205,6 +2206,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bd6ea65d0c..e9e3acc903 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,7 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#catalog_cache_prune_min_age = 300s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..a21c53644a 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    dlist_head    cc_lru_list;
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
+    dlist_node    lru_node;        /* LRU node */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +194,19 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 5c6357cc575bf0f1d03740c2f2e94d3d79a53f4e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 1 Mar 2019 14:16:55 +0900
Subject: [PATCH 3/6] Asynchronous update of catcache clock

The catcache pruning feature fails to work while a long running query
executes many commands and fetches many syscache entries. This patch
asynchronously updates the catcache clock to make the pruning work
even in the case.
---
 src/backend/tcop/postgres.c        | 11 +++++++
 src/backend/utils/cache/catcache.c | 65 ++++++++++++++++++++++++++++++++++++--
 src/backend/utils/init/globals.c   |  1 +
 src/backend/utils/init/postinit.c  | 14 ++++++++
 src/backend/utils/misc/guc.c       |  2 +-
 src/include/miscadmin.h            |  1 +
 src/include/utils/catcache.h       | 23 +++++++++++++-
 src/include/utils/timeout.h        |  1 +
 8 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 02b9ef98aa..d9a54ed37f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3161,6 +3161,14 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (CatcacheClockTimeoutPending)
+    {
+        CatcacheClockTimeoutPending = false;
+
+        /* Update timestamp then set up the next timeout */
+        UpdateCatCacheClock();
+    }
 }
 
 
@@ -4023,6 +4031,9 @@ PostgresMain(int argc, char *argv[],
         QueryCancelPending = false; /* second to avoid race condition */
         stmt_timeout_active = false;
 
+        /* get sync with the timer state */
+        catcache_clock_timeout_active = false;
+
         /* Not reading from the client anymore. */
         DoingCommandRead = false;
 
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 4386957497..e0ecfe09d4 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -69,7 +69,12 @@
 int catalog_cache_prune_min_age = 300;
 
 /*
- * Minimum interval between two successive moves of a cache entry in LRU list,
+ * Flag to keep track of whether catcache clock timer is active.
+ */
+bool catcache_clock_timeout_active = false;
+
+/*
+ * Minimum interval between two success move of a cache entry in LRU list,
  * in microseconds.
  */
 #define MIN_LRU_UPDATE_INTERVAL 100000    /* 100ms */
@@ -871,6 +876,61 @@ InitCatCache(int id,
     return cp;
 }
 
+/*
+ * Helper routine for SetCatCacheClock and UpdateCatCacheClockTimer.
+ *
+ * Maintains the catcache clock during a long query.
+ */
+void
+SetupCatCacheClockTimer(void)
+{
+    long delay;
+
+    /* stop timer if no longer needed */
+    if (catalog_cache_prune_min_age == 0)
+    {
+        catcache_clock_timeout_active = false;
+        return;
+    }
+
+    /* One 10th of the prune age, in milliseconds */
+    delay  = catalog_cache_prune_min_age * 1000/10;
+
+    /* We don't need to update the clock so frequently. */
+    if (delay < 1000)
+        delay = 1000;
+
+    enable_timeout_after(CATCACHE_CLOCK_TIMEOUT, delay);
+
+    catcache_clock_timeout_active = true;
+}
+
+/*
+ * Update catcacheclock:
+ *
+ * Intended to be called when CATCACHE_CLOCK_TIMEOUT fires. The interval is
+ * expected more than 1 second (see above), so GetCurrentTime() doesn't harm.
+ */
+void
+UpdateCatCacheClock(void)
+{
+    catcacheclock = GetCurrentTimestamp();
+    SetupCatCacheClockTimer();
+}
+
+/*
+ * Change of catalog_cache_prune_min_age requires rearming of the timer. Just
+ * disabling here causes later rearming as needed.
+ */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    if (catcache_clock_timeout_active)
+        disable_timeout(CATCACHE_CLOCK_TIMEOUT, false);
+
+    catcache_clock_timeout_active = false;
+}
+
 /*
  * CatCacheCleanupOldEntries - Remove infrequently-used entries
  *
@@ -905,7 +965,8 @@ CatCacheCleanupOldEntries(CatCache *cp)
         /*
          * Calculate the duration from the time from the last access to
          * the "current" time. catcacheclock is updated per-statement
-         * basis.
+         * basis and additionaly udpated periodically during a long
+         * running query.
          */
         TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..0e8b972a29 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t CatcacheClockTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..eb17103595 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void CatcacheClockTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(CATCACHE_CLOCK_TIMEOUT,
+                        CatcacheClockTimeoutHandler);
     }
 
     /*
@@ -1238,6 +1241,17 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+/*
+ * CATCACHE_CLOCK_TIMEOUT handler: trigger a catcache source clock update
+ */
+static void
+CatcacheClockTimeoutHandler(void)
+{
+    CatcacheClockTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3acc86cd07..0bdea0c383 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2214,7 +2214,7 @@ static struct config_int ConfigureNamesInt[] =
         },
         &catalog_cache_prune_min_age,
         300, -1, INT_MAX,
-        NULL, NULL, NULL
+        NULL, assign_catalog_cache_prune_min_age, NULL
     },
 
     /*
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..33b800e80f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index a21c53644a..5141f57bac 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -200,13 +200,34 @@ extern int catalog_cache_prune_min_age;
 /* source clock for access timestamp of catcache entries */
 extern TimestampTz catcacheclock;
 
-/* SetCatCacheClock - set catcache timestamp source clodk */
+/*
+ * Flag to keep track of whether catcache timestamp timer is active.
+ */
+extern bool catcache_clock_timeout_active;
+
+/* catcache prune time helper functions  */
+extern void SetupCatCacheClockTimer(void);
+extern void UpdateCatCacheClock(void);
+
+/*
+ * SetCatCacheClock - set catcache timestamp source clock
+ *
+ * The clock is passively updated per-query basis. We need to update it
+ * asynchronously in the case where a long running query executes many
+ * commands. Setup the timeout to do that. Setting a timout is so complex that
+ * we don't want do that for every query start so it runs until
+ * catalog_cache_prune_min_age is changed. See UpdateCatCacheClock().
+ */
 static inline void
 SetCatCacheClock(TimestampTz ts)
 {
     catcacheclock = ts;
+
+    if (!catcache_clock_timeout_active && catalog_cache_prune_min_age > 0)
+        SetupCatCacheClockTimer();
 }
 
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..b2d97b4f7b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    CATCACHE_CLOCK_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From 89f64ee52ea4656b8397524d511abbdf793521b9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 1 Mar 2019 12:00:26 +0900
Subject: [PATCH 4/6] Syscache usage tracking feature

Collects syscache usage statictics and show it using the view
pg_stat_syscache. The feature is controlled by the GUC variable
track_syscache_usage_interval.
---
 doc/src/sgml/config.sgml                      |  16 ++
 src/backend/catalog/system_views.sql          |  17 +++
 src/backend/postmaster/pgstat.c               | 201 ++++++++++++++++++++++++--
 src/backend/tcop/postgres.c                   |  23 +++
 src/backend/utils/adt/pgstatfuncs.c           | 133 +++++++++++++++++
 src/backend/utils/cache/catcache.c            | 145 +++++++++++++++----
 src/backend/utils/cache/syscache.c            |  24 +++
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/init/postinit.c             |  11 ++
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat               |   9 ++
 src/include/miscadmin.h                       |   1 +
 src/include/pgstat.h                          |   4 +
 src/include/utils/catcache.h                  |  13 +-
 src/include/utils/syscache.h                  |  19 +++
 src/include/utils/timeout.h                   |   1 +
 src/test/regress/expected/rules.out           |  24 ++-
 18 files changed, 612 insertions(+), 41 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 737a156bb4..850fe4ea90 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6689,6 +6689,22 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-catalog-cache-usage-interval" xreflabel="track_catalog_cache_usage_interval">
+      <term><varname>track_catalog_cache_usage_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>track_catlog_cache_usage_interval</varname>
+       configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the interval to collect catalog cache usage statistics on
+        the session in milliseconds. This parameter is 0 by default, which
+        means disabled.  Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing">
       <term><varname>track_io_timing</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..f5d1aaf96f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_syscache AS
+    SELECT
+        S.pid                AS pid,
+        S.relid::regclass    AS relname,
+        S.indid::regclass    AS cache_name,
+        S.size                AS size,
+        S.ntup                AS ntuples,
+        S.searches            AS searches,
+        S.hits                AS hits,
+        S.neg_hits            AS neg_hits,
+        S.ageclass            AS ageclass,
+        S.last_update        AS last_update
+    FROM pg_stat_activity A
+    JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S
+        ON (A.pid = S.pid);
+
 CREATE VIEW pg_user_mappings AS
     SELECT
         U.oid       AS umid,
@@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor;
 GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor;
+GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor;
 
 GRANT pg_read_all_settings TO pg_monitor;
 GRANT pg_read_all_stats TO pg_monitor;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..b15a3273ca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -66,6 +66,7 @@
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 
@@ -124,6 +125,7 @@
 bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
+int            pgstat_track_syscache_usage_interval = 0;
 int            pgstat_track_activity_query_size = 1024;
 
 /* ----------
@@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+/* bitmap symbols to specify target file types remove */
+#define PGSTAT_REMFILE_DBSTAT    1        /* remove only database stats files */
+#define PGSTAT_REMFILE_SYSCACHE    2        /* remove only syscache stats files */
+#define PGSTAT_REMFILE_ALL        3        /* remove both type of files */
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
 static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
 static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static void pgstat_remove_syscache_statsfile(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -630,10 +638,13 @@ startup_failed:
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * remove stats files
+ *
+ * clean up stats files in specified directory. target is one of
+ * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove.
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_reset_remove_files(const char *directory, int target)
 {
     DIR           *dir;
     struct dirent *entry;
@@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory)
     {
         int            nchars;
         Oid            tmp_oid;
+        int            filetype = 0;
 
         /*
          * Skip directory entries that don't match the file names we write.
          * See get_dbstat_filename for the database-specific pattern.
          */
         if (strncmp(entry->d_name, "global.", 7) == 0)
+        {
+            filetype = PGSTAT_REMFILE_DBSTAT;
             nchars = 7;
+        }
         else
         {
+            char head[2];
+
             nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
+            (void) sscanf(entry->d_name, "%c%c_%u.%n",
+                          head, head + 1, &tmp_oid, &nchars);
+
             /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
+            if (nchars < 3 || !isdigit(entry->d_name[3]))
                 continue;
+
+            if  (strncmp(head, "db", 2) == 0)
+                filetype = PGSTAT_REMFILE_DBSTAT;
+            else if (strncmp(head, "cc", 2) == 0)
+                filetype = PGSTAT_REMFILE_SYSCACHE;
         }
 
+        /* skip if this is not a target */
+        if ((filetype & target) == 0)
+            continue;
+
         if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
             strcmp(entry->d_name + nchars, "stat") != 0)
             continue;
@@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL);
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY,
+                              PGSTAT_REMFILE_ALL);
 }
 
 #ifdef EXEC_BACKEND
@@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /* clear syscache statistics files and temporary settings */
+    if (MyBackendId != InvalidBackendId)
+        pgstat_remove_syscache_statsfile();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[])
     pgStatRunningInCollector = true;
     pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
 
+    /* Remove left-over syscache stats files */
+    pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE);
+
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
      * death of our parent postmaster.
@@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity)
 
     return activity;
 }
+
+/*
+ * return the filename for a syscache stat file; filename is the output
+ * buffer, of length len.
+ */
+void
+pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid,
+                                 char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/cc_%u.%s",
+                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
+                       pgstat_stat_directory,
+                       backendid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* removes syscache stats files of this backend */
+static void
+pgstat_remove_syscache_statsfile(void)
+{
+    char    fname[MAXPGPATH];
+
+    pgstat_get_syscachestat_filename(false, false, MyBackendId,
+                                     fname, MAXPGPATH);
+    unlink(fname);        /* don't care of the result */
+}
+
+/*
+ * pgstat_write_syscache_stats() -
+ *        Write the syscache statistics files.
+ *
+ * If 'force' is false, this function skips writing a file and returns the
+ * time remaining in the current interval in milliseconds. If 'force' is true,
+ * writes a file regardless of the remaining time and reset the interval.
+ */
+long
+pgstat_write_syscache_stats(bool force)
+{
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    long elapsed;
+    long secs;
+    int     usecs;
+    int    cacheId;
+    FILE    *fpout;
+    char    statfile[MAXPGPATH];
+    char    tmpfile[MAXPGPATH];
+
+    /* Return if we don't want it */
+    if (!force && pgstat_track_syscache_usage_interval <= 0)
+    {
+        /* disabled. remove the statistics file if any */
+        if (last_report > 0)
+        {
+            last_report = 0;
+            pgstat_remove_syscache_statsfile();
+        }
+        return 0;
+    }
+
+    /* Check against the interval */
+    now = GetCurrentTransactionStopTimestamp();
+    TimestampDifference(last_report, now, &secs, &usecs);
+    elapsed = secs * 1000 + usecs / 1000;
+
+    if (!force && elapsed < pgstat_track_syscache_usage_interval)
+    {
+        /* not yet the time, inform the remaining time to the caller */
+        return pgstat_track_syscache_usage_interval - elapsed;
+    }
+
+    /* now update the stats */
+    last_report = now;
+
+    pgstat_get_syscachestat_filename(false, true,
+                                     MyBackendId, tmpfile, MAXPGPATH);
+    pgstat_get_syscachestat_filename(false, false,
+                                     MyBackendId, statfile, MAXPGPATH);
+
+    /*
+     * This function can be called from ProcessInterrupts(). Inhibit recursive
+     * interrupts to avoid recursive entry.
+     */
+    HOLD_INTERRUPTS();
+
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        /*
+         * Failure writing this file is not critical. Just skip this time and
+         * tell caller to wait for the next interval.
+         */
+        RESUME_INTERRUPTS();
+        return pgstat_track_syscache_usage_interval;
+    }
+
+    /* write out every catcache stats */
+    for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++)
+    {
+        SysCacheStats *stats;
+
+        stats = SysCacheGetStats(cacheId);
+        Assert (stats);
+
+        /* write error is checked later using ferror() */
+        fputc('T', fpout);
+        (void)fwrite(&cacheId, sizeof(int), 1, fpout);
+        (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout);
+        (void)fwrite(stats, sizeof(*stats), 1, fpout);
+    }
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close syscache statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+
+    RESUME_INTERRUPTS();
+    return 0;
+}
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d9a54ed37f..39abb9fbab 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3159,6 +3159,12 @@ ProcessInterrupts(void)
 
     }
 
+    if (IdleSyscacheStatsUpdateTimeoutPending)
+    {
+        IdleSyscacheStatsUpdateTimeoutPending = false;
+        pgstat_write_syscache_stats(true);
+    }
+
     if (ParallelMessagePending)
         HandleParallelMessages();
 
@@ -3743,6 +3749,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_syscache_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4186,9 +4193,19 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long timeout;
+
                 ProcessCompletedNotifies();
                 pgstat_report_stat(false);
 
+                timeout = pgstat_write_syscache_stats(false);
+
+                if (timeout > 0)
+                {
+                    disable_idle_syscache_update_timeout = true;
+                    enable_timeout_after(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT,
+                                         timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4231,6 +4248,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_syscache_update_timeout)
+        {
+            disable_timeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_syscache_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 69f7265779..26f923a66b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -14,6 +14,8 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
@@ -28,6 +30,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/inet.h"
+#include "utils/syscache.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
@@ -1908,3 +1911,133 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     PG_RETURN_DATUM(HeapTupleGetDatum(
                                       heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+pgstat_get_syscache_stats(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SYSCACHE_SIZE 9
+    int                    pid     = PG_GETARG_INT32(0);
+    ReturnSetInfo       *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc            tupdesc;
+    Tuplestorestate    *tupstore;
+    MemoryContext        per_query_ctx;
+    MemoryContext        oldcontext;
+    PgBackendStatus       *beentry;
+    int                    beid;
+    char                fname[MAXPGPATH];
+    FILE                  *fpin;
+    char c;
+
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* find beentry for given pid*/
+    beentry = NULL;
+    for (beid = 1;
+         (beentry = pgstat_fetch_stat_beentry(beid)) &&
+             beentry->st_procpid != pid ;
+         beid++);
+
+    /*
+     * we silently return empty result on failure or insufficient privileges
+     */
+    if (!beentry ||
+        (!has_privs_of_role(GetUserId(), beentry->st_userid) &&
+         !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS)))
+        goto no_data;
+
+    pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH);
+
+    if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(WARNING,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            fname)));
+        /* also return empty on no statistics file */
+        goto no_data;
+    }
+
+    /* read the statistics file into tuplestore */
+    while ((c = fgetc(fpin)) == 'T')
+    {
+        TimestampTz last_update;
+        SysCacheStats stats;
+        int cacheid;
+        Datum values[PG_GET_SYSCACHE_SIZE];
+        bool nulls[PG_GET_SYSCACHE_SIZE] = {0};
+        Datum datums[SYSCACHE_STATS_NAGECLASSES * 2];
+        bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0};
+        int    dims[] = {SYSCACHE_STATS_NAGECLASSES, 2};
+        int lbs[] = {1, 1};
+        ArrayType *arr;
+        int i, j;
+
+        if (fread(&cacheid, sizeof(int), 1, fpin) != 1 ||
+            fread(&last_update, sizeof(TimestampTz), 1, fpin) != 1 ||
+            fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats))
+        {
+            ereport(WARNING,
+                    (errmsg("corrupted syscache statistics file \"%s\"",
+                            fname)));
+            goto no_data;
+        }
+
+        i = 0;
+        values[i++] = ObjectIdGetDatum(stats.reloid);
+        values[i++] = ObjectIdGetDatum(stats.indoid);
+        values[i++] = Int64GetDatum(stats.size);
+        values[i++] = Int64GetDatum(stats.ntuples);
+        values[i++] = Int64GetDatum(stats.nsearches);
+        values[i++] = Int64GetDatum(stats.nhits);
+        values[i++] = Int64GetDatum(stats.nneg_hits);
+
+        for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++)
+        {
+            datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]);
+            datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]);
+        }
+
+        arr = construct_md_array(datums, arrnulls, 2, dims, lbs,
+                              INT4OID, sizeof(int32), true, 'i');
+        values[i++] = PointerGetDatum(arr);
+
+        values[i++] = TimestampTzGetDatum(last_update);
+
+        Assert (i == PG_GET_SYSCACHE_SIZE);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* check for the end of file. abandon the result if file is broken */
+    if (c != 'E' || fgetc(fpin) != EOF)
+        tuplestore_clear(tupstore);
+
+    FreeFile(fpin);
+
+no_data:
+    tuplestore_donestoring(tupstore);
+    return (Datum) 0;
+}
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index e0ecfe09d4..63c0ea3b17 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -85,6 +85,10 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Clock for the last accessed time of a catcache entry. */
 TimestampTz    catcacheclock = 0;
 
+/* age classes for pruning */
+static double ageclass[SYSCACHE_STATS_NAGECLASSES]
+    = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0};
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -118,7 +122,7 @@ static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp,
 
 static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                  Datum *keys);
-static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
+static int  CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                  Datum *srckeys, Datum *dstkeys);
 
 
@@ -500,6 +504,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
         CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
                          cache->cc_keyno, ct->keys);
 
+    cache->cc_memusage -= ct->size;
     pfree(ct);
 
     --cache->cc_ntup;
@@ -613,9 +618,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             else
                 CatCacheRemoveCTup(cache, ct);
             CACHE_elog(DEBUG2, "CatCacheInvalidate: invalidated");
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
             /* could be multiple matches, so keep looking! */
         }
     }
@@ -691,9 +694,7 @@ ResetCatalogCache(CatCache *cache)
             }
             else
                 CatCacheRemoveCTup(cache, ct);
-#ifdef CATCACHE_STATS
             cache->cc_invals++;
-#endif
         }
     }
 }
@@ -833,7 +834,12 @@ InitCatCache(int id,
      */
     sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
     cp = (CatCache *) CACHELINEALIGN(palloc0(sz));
-    cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head));
+    cp->cc_head_alloc_size = sz;
+    sz = nbuckets * sizeof(dlist_head);
+    cp->cc_bucket = palloc0(sz);
+
+    /* cc_head_alloc_size + consumed size for cc_bucket */
+    cp->cc_memusage = cp->cc_head_alloc_size + sz;
 
     /*
      * initialize the cache's relation information for the relation
@@ -1011,13 +1017,17 @@ RehashCatCache(CatCache *cp)
     dlist_head *newbucket;
     int            newnbuckets;
     int            i;
+    size_t        sz;
 
     elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets",
          cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets);
 
     /* Allocate a new, larger, hash table. */
     newnbuckets = cp->cc_nbuckets * 2;
-    newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head));
+    sz = newnbuckets * sizeof(dlist_head);
+    newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, sz);
+
+    cp->cc_memusage = cp->cc_head_alloc_size + sz;
 
     /* Move all entries from old hash table to new. */
     for (i = 0; i < cp->cc_nbuckets; i++)
@@ -1031,6 +1041,7 @@ RehashCatCache(CatCache *cp)
 
             dlist_delete(iter.cur);
             dlist_push_head(&newbucket[hashIndex], &ct->cache_elem);
+            cp->cc_memusage += ct->size;
         }
     }
 
@@ -1369,9 +1380,7 @@ SearchCatCacheInternal(CatCache *cache,
     if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);
 
-#ifdef CATCACHE_STATS
     cache->cc_searches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1440,9 +1449,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
                        cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_hits++;
-#endif
 
             return &ct->tuple;
         }
@@ -1451,9 +1458,7 @@ SearchCatCacheInternal(CatCache *cache,
             CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
                        cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
             cache->cc_neg_hits++;
-#endif
 
             return NULL;
         }
@@ -1581,9 +1586,7 @@ SearchCatCacheMiss(CatCache *cache,
     CACHE_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d",
                cache->cc_relname, hashIndex);
 
-#ifdef CATCACHE_STATS
     cache->cc_newloads++;
-#endif
 
     return &ct->tuple;
 }
@@ -1694,9 +1697,7 @@ SearchCatCacheList(CatCache *cache,
 
     Assert(nkeys > 0 && nkeys < cache->cc_nkeys);
 
-#ifdef CATCACHE_STATS
     cache->cc_lsearches++;
-#endif
 
     /* Initialize local parameter array */
     arguments[0] = v1;
@@ -1753,9 +1754,7 @@ SearchCatCacheList(CatCache *cache,
         CACHE_elog(DEBUG2, "SearchCatCacheList(%s): found list",
                    cache->cc_relname);
 
-#ifdef CATCACHE_STATS
         cache->cc_lhits++;
-#endif
 
         return cl;
     }
@@ -1862,6 +1861,11 @@ SearchCatCacheList(CatCache *cache,
         /* Now we can build the CatCList entry. */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
         nmembers = list_length(ctlist);
+
+        /*
+         * Don't waste a time by counting the list's memory usage, since it
+         * doesn't live a long life.
+         */
         cl = (CatCList *)
             palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *));
 
@@ -1972,6 +1976,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CatCTup    *ct;
     HeapTuple    dtp;
     MemoryContext oldcxt;
+    int            tupsize;
 
     /* negative entries have no tuple associated */
     if (ntp)
@@ -1995,8 +2000,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
         /* Allocate memory for CatCTup and the cached tuple in one go */
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 
-        ct = (CatCTup *) palloc(sizeof(CatCTup) +
-                                MAXIMUM_ALIGNOF + dtp->t_len);
+        tupsize = sizeof(CatCTup) +    MAXIMUM_ALIGNOF + dtp->t_len;
+        ct = (CatCTup *) palloc(tupsize);
         ct->tuple.t_len = dtp->t_len;
         ct->tuple.t_self = dtp->t_self;
         ct->tuple.t_tableOid = dtp->t_tableOid;
@@ -2029,14 +2034,16 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     {
         Assert(negative);
         oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-        ct = (CatCTup *) palloc(sizeof(CatCTup));
+        tupsize = sizeof(CatCTup);
+        ct = (CatCTup *) palloc(tupsize);
 
         /*
          * Store keys - they'll point into separately allocated memory if not
          * by-value.
          */
-        CatCacheCopyKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno,
-                         arguments, ct->keys);
+        tupsize +=
+            CatCacheCopyKeys(cache->cc_tupdesc, cache->cc_nkeys,
+                             cache->cc_keyno, arguments, ct->keys);
         MemoryContextSwitchTo(oldcxt);
     }
 
@@ -2060,7 +2067,10 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
-    /* increase refcount so that the new entry survives pruning */
+    ct->size = tupsize;
+    cache->cc_memusage += ct->size;
+
+    /* increase refcount so that this survives pruning */
     ct->refcount++;
 
     /*
@@ -2103,13 +2113,14 @@ CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys)
 /*
  * Helper routine that copies the keys in the srckeys array into the dstkeys
  * one, guaranteeing that the datums are fully allocated in the current memory
- * context.
+ * context. Returns allocated memory size.
  */
-static void
+static int
 CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                  Datum *srckeys, Datum *dstkeys)
 {
     int            i;
+    int            size = 0;
 
     /*
      * XXX: memory and lookup performance could possibly be improved by
@@ -2138,8 +2149,25 @@ CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
         dstkeys[i] = datumCopy(src,
                                att->attbyval,
                                att->attlen);
+
+        /* calculate rough estimate memory usage by datumCopy */
+        if (!att->attbyval)
+        {
+            if (att->attlen == -1)
+            {
+                struct varlena *vl = (struct varlena *) DatumGetPointer(src);
+                
+                if (VARATT_IS_EXTERNAL_EXPANDED(vl))
+                    size += EOH_get_flat_size(DatumGetEOHP(src));
+                else
+                    size += VARSIZE_ANY(vl);
+            }
+            else
+                size += datumGetSize(src, att->attbyval, att->attlen);
+        }
     }
 
+    return size;
 }
 
 /*
@@ -2263,3 +2291,66 @@ PrintCatCacheListLeakWarning(CatCList *list)
          list->my_cache->cc_relname, list->my_cache->id,
          list, list->refcount);
 }
+
+/*
+ * CatCacheGetStats - fill in SysCacheStats struct.
+ *
+ * This is a support routine for SysCacheGetStats, substantially fills in the
+ * result. The classification here is based on the same criteria to
+ * CatCacheCleanupOldEntries().
+ */
+void
+CatCacheGetStats(CatCache *cache, SysCacheStats *stats)
+{
+    int    i, j;
+
+    Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0);
+
+    /* fill in the stats struct */
+    stats->size = cache->cc_memusage;
+    stats->ntuples = cache->cc_ntup;
+    stats->nsearches = cache->cc_searches;
+    stats->nhits = cache->cc_hits;
+    stats->nneg_hits = cache->cc_neg_hits;
+
+    /*
+     * catalog_cache_prune_min_age can be changed on-session, fill it every
+     * time
+     */
+    for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++)
+        stats->ageclasses[i] =
+            (int) (catalog_cache_prune_min_age * ageclass[i]);
+
+    /*
+     * nth element in nclass_entries stores the number of cache entries that
+     * have lived unaccessed for corresponding multiple in ageclass of
+     * catalog_cache_prune_min_age.
+     */
+    memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES);
+
+    /* Scan the whole hash */
+    for (i = 0; i < cache->cc_nbuckets; i++)
+    {
+        dlist_mutable_iter iter;
+
+        dlist_foreach_modify(iter, &cache->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long entry_age;
+            int us;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. See CatCacheCleanupOldEntries for details.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            j = 0;
+            while (j < SYSCACHE_STATS_NAGECLASSES - 1 &&
+                   entry_age > stats->ageclasses[j])
+                j++;
+
+            stats->nclass_entries[j]++;
+        }
+    }
+}
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index ac98c19155..7b38a06708 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -20,6 +20,9 @@
  */
 #include "postgres.h"
 
+#include <sys/stat.h>
+#include <unistd.h>
+
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/indexing.h"
@@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid)
     return false;
 }
 
+/*
+ * SysCacheGetStats - returns stats of specified syscache
+ *
+ * This routine returns the address of its local static memory.
+ */
+SysCacheStats *
+SysCacheGetStats(int cacheId)
+{
+    static SysCacheStats stats;
+
+    Assert(cacheId >=0 && cacheId < SysCacheSize);
+
+    memset(&stats, 0, sizeof(stats));
+
+    stats.reloid = cacheinfo[cacheId].reloid;
+    stats.indoid = cacheinfo[cacheId].indoid;
+
+    CatCacheGetStats(SysCache[cacheId], &stats);
+
+    return &stats;
+}
 
 /*
  * OID comparator for pg_qsort
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 0e8b972a29..b7c647b5e0 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t CatcacheClockTimeoutPending = false;
+volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index eb17103595..f2f879b6d8 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
 static void CatcacheClockTimeoutHandler(void);
+static void IdleSyscacheStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
                         IdleInTransactionSessionTimeoutHandler);
         RegisterTimeout(CATCACHE_CLOCK_TIMEOUT,
                         CatcacheClockTimeoutHandler);
+        RegisterTimeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT,
+                        IdleSyscacheStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1252,6 +1255,14 @@ CatcacheClockTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleSyscacheStatsUpdateTimeoutHandler(void)
+{
+    IdleSyscacheStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0bdea0c383..5c8c9146d1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3158,6 +3158,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"track_catalog_cache_usage_interval", PGC_SUSET, STATS_COLLECTOR,
+            gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache
usagetracking."), 
+            NULL
+        },
+        &pgstat_track_syscache_usage_interval,
+        0, 0, INT_MAX / 2,
+        NULL, NULL, NULL
+    },
+
     {
         {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT,
             gettext_noop("Sets the maximum size of the pending list for GIN index."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9e3acc903..4d39daced6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -555,6 +555,7 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
+#track_catlog_cache_usage_interval = 0    # zero disables tracking
 #stats_temp_directory = 'pg_stat_tmp'
 
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a4e173b484..1a67c4219f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -9689,6 +9689,15 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
   proargnames =>
'{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}',
   prosrc => 'pg_get_replication_slots' },
+{ oid => '3425',
+  descr => 'syscache statistics',
+  proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f',
+  proretset => 't', provolatile => 'v', prorettype => 'record',
+  proargtypes => 'int4',
+  proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}',
+  prosrc => 'pgstat_get_syscache_stats' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
   proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 33b800e80f..767c94a63c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..c90ee1a064 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern int    pgstat_track_syscache_usage_interval;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -1228,6 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern void pgstat_get_syscachestat_filename(bool permanent,
+                    bool tempname, int backendid, char *filename, int len);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1363,5 +1366,6 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern long pgstat_write_syscache_stats(bool force);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 5141f57bac..310aeaeab5 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -63,12 +63,13 @@ typedef struct catcache
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
     dlist_head    cc_lru_list;
+    int            cc_head_alloc_size;/* consumed memory to allocate this struct */
+    int            cc_memusage;    /* memory usage of this catcache (excluding
+                                 * header part) */
 
     /*
-     * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
-     * doesn't break ABI for other modules
+     * Statistics entries
      */
-#ifdef CATCACHE_STATS
     long        cc_searches;    /* total # searches against this cache */
     long        cc_hits;        /* # of matches against existing entry */
     long        cc_neg_hits;    /* # of matches against negative entry */
@@ -81,7 +82,6 @@ typedef struct catcache
     long        cc_invals;        /* # of entries invalidated from cache */
     long        cc_lsearches;    /* total # list-searches */
     long        cc_lhits;        /* # of matches against existing lists */
-#endif
 } CatCache;
 
 
@@ -124,6 +124,7 @@ typedef struct catctup
     int            naccess;        /* # of access to this entry, up to 2  */
     TimestampTz    lastaccess;        /* timestamp of the last usage */
     dlist_node    lru_node;        /* LRU node */
+    int            size;            /* palloc'ed size off this tuple */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -267,4 +268,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* defined in syscache.h */
+typedef struct syscachestats SysCacheStats;
+extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats);
+
 #endif                            /* CATCACHE_H */
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 95ee48954e..71b399c902 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -112,6 +112,24 @@ enum SysCacheIdentifier
 #define SysCacheSize (USERMAPPINGUSERSERVER + 1)
 };
 
+#define SYSCACHE_STATS_NAGECLASSES 6
+/* Struct for catcache tracking information */
+typedef struct syscachestats
+{
+    Oid        reloid;            /* target relation */
+    Oid        indoid;            /* index */
+    size_t    size;            /* size of the catcache */
+    int        ntuples;        /* number of tuples resides in the catcache */
+    int        nsearches;        /* number of searches */
+    int        nhits;            /* number of cache hits */
+    int        nneg_hits;        /* number of negative cache hits */
+    /* age classes in seconds */
+    int        ageclasses[SYSCACHE_STATS_NAGECLASSES];
+    /* number of tuples fall into the corresponding age class */
+    int        nclass_entries[SYSCACHE_STATS_NAGECLASSES];
+} SysCacheStats;
+
+
 extern void InitCatalogCache(void);
 extern void InitCatalogCachePhase2(void);
 
@@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue);
 extern bool RelationInvalidatesSnapshotsOnly(Oid relid);
 extern bool RelationHasSysCache(Oid relid);
 extern bool RelationSupportsSysCache(Oid relid);
+extern SysCacheStats *SysCacheGetStats(int cacheId);
 
 /*
  * The use of the macros below rather than direct calls to the corresponding
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index b2d97b4f7b..0677978923 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -32,6 +32,7 @@ typedef enum TimeoutId
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
     CATCACHE_CLOCK_TIMEOUT,
+    IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 98f417cb57..cf404a3930 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1929,6 +1929,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR
(pg_stat_all_tables.schemaname~ '^pg_toast'::text));
 
+pg_stat_syscache| SELECT s.pid,
+    (s.relid)::regclass AS relname,
+    (s.indid)::regclass AS cache_name,
+    s.size,
+    s.ntup AS ntuples,
+    s.searches,
+    s.hits,
+    s.neg_hits,
+    s.ageclass,
+    s.last_update
+   FROM (pg_stat_activity a
+     JOIN LATERAL ( SELECT a.pid,
+            pg_get_syscache_stats.relid,
+            pg_get_syscache_stats.indid,
+            pg_get_syscache_stats.size,
+            pg_get_syscache_stats.ntup,
+            pg_get_syscache_stats.searches,
+            pg_get_syscache_stats.hits,
+            pg_get_syscache_stats.neg_hits,
+            pg_get_syscache_stats.ageclass,
+            pg_get_syscache_stats.last_update
+           FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits,
ageclass,last_update)) s ON ((a.pid = s.pid)));
 
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
@@ -2360,7 +2382,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS
     ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING;
 pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS
     ON UPDATE TO pg_catalog.pg_settings
-   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false) AS set_config;
+   WHERE (new.name = old.name) DO  SELECT set_config(old.name, new.setting, false, false) AS set_config;
 rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS
     ON DELETE TO public.rtest_emp DO  INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal)
   VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary);
-- 
2.16.3


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com]

>> [Size=800, iter=1,000,000]
>> Master |15.763
>> Patched|16.262 (+3%)
>>
>> [Size=32768, iter=1,000,000]
>> Master |61.3076
>> Patched|62.9566 (+2%)
>
>What's the unit, second or millisecond?
Millisecond.

>Why is the number of digits to the right of the decimal point?
>
>Is the measurement correct?  I'm wondering because the difference is larger in the
>latter case.  Isn't the accounting processing almost the same in both cases?
>* former: 16.262 - 15.763 = 4.99
>* latter: 62.956 - 61.307 = 16.49
>I think the overhead is sufficiently small.  It may get even smaller with a trivial tweak.
>
>You added the new member usedspace at the end of MemoryContextData.  The
>original size of MemoryContextData is 72 bytes, and Intel Xeon's cache line is 64 bytes.
>So, the new member will be on a separate cache line.  Try putting usedspace before
>the name member.

OK. I changed the order of MemoryContextData members to fit usedspace into one cacheline.
I disabled all the catcache eviction mechanism in patched one and compared it with master
to investigate that overhead of memory accounting become small enough. 

The settings are almost same as the last email. 
But last time the number of trials was 50 so I increased it and tried 5000 times to 
calculate the average figure (rounded off to three decimal place).
 [Size=800, iter=1,000,000]
  Master |15.64 ms
  Patched|16.26 ms (+4%)
  The difference is  0.62ms

 [Size=32768, iter=1,000,000]
  Master |61.39 ms
  Patched|60.99 ms (-1%)
  
I guess there is around 2% noise.
But based on this experiment it seems the overhead small.
Still there is some overhead but it can be covered by some other 
manipulation like malloc().

Does this result show that hard-limit size option with memory accounting 
doesn't harm to usual users who disable hard limit size option?

Regards,
Takeshi Ideriha

Вложения

Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Mon, 4 Mar 2019 03:03:51 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in
<4E72940DA2BF16479384A86D54D0988A6F44564E@G01JPEXMBKW04>
> Does this result show that hard-limit size option with memory accounting 
> doesn't harm to usual users who disable hard limit size option?

Not sure, but 4% seems beyond noise level. Planner requests
mainly smaller allocation sizes especially for list
operations. If we implement it for slab allocator, the
degradation would be more significant.

We *are* suffering from endless bloat of system cache (and some
other stuffs) and there is no way to deal with it. The soft limit
feature actually eliminates the problem with no degradation and
even accelerates execution in some cases.

Infinite bloat is itself a problem, but if the processes just
needs more but finite size of memory, just additional memory or
less max_connections is enough.

What Andres and Robert suggested is we need more convincing
reason for the hard limit feature other than "some is wanting
it". The degradation of the crude accounting stuff is not the
primary issue here.  I think.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Fri, Mar 1, 2019 at 3:33 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > It is artificial (or acutually wont't be repeatedly executed in a
> > > session) but anyway what can get benefit from
> > > catalog_cache_memory_target would be a kind of extreme.
> >
> > I agree.  So then let's not have it.
>
> Ah... Yeah! I see. Andres' concern was that crucial syscache
> entries might be blown away during a long idle time. If that
> happens, it's enough to just turn off in the almost all of such
> cases.

+1.

> In the attached v18,
>    catalog_cache_memory_target is removed,
>    removed some leftover of removing the hard limit feature,
>    separated catcache clock update during a query into 0003.
>    attached 0004 (monitor part) in order just to see how it is working.
>
> v18-0001-Add-dlist_move_tail:
>   Just adds dlist_move_tail
>
> v18-0002-Remove-entries-that-haven-t-been-used-for-a-certain-:
>   Revised pruning feature.

OK, so this is getting simpler, but I'm wondering why we need
dlist_move_tail() at all.  It is a well-known fact that maintaining
LRU ordering is expensive and it seems to be unnecessary for our
purposes here.  Can't CatCacheCleanupOldEntries just use a single-bit
flag on the entry?  If the flag is set, clear it.  If the flag is
clear, drop the entry.  When an entry is used, set the flag.  Then,
entries will go away if they are not used between consecutive calls to
CatCacheCleanupOldEntries.  Sure, that might be slightly less accurate
in terms of which entries get thrown away, but I bet it makes no real
difference.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> OK, so this is getting simpler, but I'm wondering why we need
> dlist_move_tail() at all.  It is a well-known fact that maintaining
> LRU ordering is expensive and it seems to be unnecessary for our
> purposes here.

Yeah ... LRU maintenance was another thing that used to be in the
catcache logic and was thrown out as far too expensive.  Your idea
of just using a clock sweep instead seems plausible.

            regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:
On 3/6/19 9:17 PM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> OK, so this is getting simpler, but I'm wondering why we need
>> dlist_move_tail() at all.  It is a well-known fact that maintaining
>> LRU ordering is expensive and it seems to be unnecessary for our
>> purposes here.
> 
> Yeah ... LRU maintenance was another thing that used to be in the
> catcache logic and was thrown out as far too expensive.  Your idea
> of just using a clock sweep instead seems plausible.
> 

I agree clock sweep might be sufficient, although the benchmarks done in
this thread so far do not suggest the LRU approach is very expensive.

A simple true/false flag, as proposed by Robert, would mean we can only
do the cleanup once per the catalog_cache_prune_min_age interval, so
with the default value (5 minutes) the entries might be between 5 and 10
minutes old. That's probably acceptable, although for higher values the
range gets wider and wider ...

Which part of the LRU approach is supposedly expensive? Updating the
lastaccess field or moving the entries to tail? I'd guess it's the
latter, so perhaps we can keep some sort of age field, update it less
frequently (once per minute?), and do the clock sweep?

BTW wasn't one of the cases this thread aimed to improve a session that
accesses a lot of objects in a short period of time? That balloons the
syscache, and while this patch evicts the entries from memory, we never
actually release the memory back (because AllocSet just moves it into
the freelists) and it's unlikely to get swapped out (because other
chunks on those memory pages are likely to be still used). I've proposed
to address that by recreating the context if it gets too bloated, and I
think Alvaro agreed with that. But I haven't seen any further discussion
about that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Wed, Mar 6, 2019 at 6:18 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I agree clock sweep might be sufficient, although the benchmarks done in
> this thread so far do not suggest the LRU approach is very expensive.

I'm not sure how thoroughly it's been tested -- has someone
constructed a benchmark that does a lot of syscache lookups and
measured how much slower they get with this new code?

> A simple true/false flag, as proposed by Robert, would mean we can only
> do the cleanup once per the catalog_cache_prune_min_age interval, so
> with the default value (5 minutes) the entries might be between 5 and 10
> minutes old. That's probably acceptable, although for higher values the
> range gets wider and wider ...

That's true, but I don't know that it matters.  I'm not sure there's
much of a use case for raising this parameter to some larger value,
but even if there is, is it really worth complicating the mechanism to
make sure that we throw away entries in a more timely fashion?  That's
not going to be cost-free, either in terms of CPU cycles or in terms
of code complexity.

Again, I think our goal should be to add the least mechanism here that
solves the problem.  If we can show that a true/false flag makes poor
decisions about which entries to evict and a smarter algorithm does
better, then it's worth considering.  However, my bet is that it makes
no meaningful difference.

> Which part of the LRU approach is supposedly expensive? Updating the
> lastaccess field or moving the entries to tail? I'd guess it's the
> latter, so perhaps we can keep some sort of age field, update it less
> frequently (once per minute?), and do the clock sweep?

Move to tail (although lastaccess would be expensive if too if it
involves an extra gettimeofday() call).  GCLOCK, like we use for
shared_buffers, is a common approximation of LRU which tends to be a
lot less expensive to implement.  We could do that here and it might
work well, but I think the question, again, is whether we really need
it.  I think our goal here should just be to jettison cache entries
that are clearly worthless.  It's expensive enough to reload cache
entries that any kind of aggressive eviction policy is probably a
loser, and if our goal is just to get rid of the stuff that's clearly
not being used, we don't need to be super-accurate about it.

> BTW wasn't one of the cases this thread aimed to improve a session that
> accesses a lot of objects in a short period of time? That balloons the
> syscache, and while this patch evicts the entries from memory, we never
> actually release the memory back (because AllocSet just moves it into
> the freelists) and it's unlikely to get swapped out (because other
> chunks on those memory pages are likely to be still used). I've proposed
> to address that by recreating the context if it gets too bloated, and I
> think Alvaro agreed with that. But I haven't seen any further discussion
> about that.

That's an interesting point.  It seems reasonable to me to just throw
away everything and release all memory if the session has been idle
for a while, but if the session is busy doing stuff, discarding
everything in bulk like that is going to cause latency spikes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:

On 3/7/19 3:34 PM, Robert Haas wrote:
> On Wed, Mar 6, 2019 at 6:18 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I agree clock sweep might be sufficient, although the benchmarks done in
>> this thread so far do not suggest the LRU approach is very expensive.
> 
> I'm not sure how thoroughly it's been tested -- has someone
> constructed a benchmark that does a lot of syscache lookups and
> measured how much slower they get with this new code?
> 

What I've done on v13 (and I don't think the results would be that
different on the current patch, but I may rerun it if needed) is a test
that creates large number of tables (up to 1M) and then accesses them
randomly. I don't know if it matches what you imagine, but see [1]

https://www.postgresql.org/message-id/74386116-0bc5-84f2-e614-0cff19aca2de%402ndquadrant.com

I don't think this shows any regression, but perhaps we should do a
microbenchmark isolating the syscache entirely?

>> A simple true/false flag, as proposed by Robert, would mean we can only
>> do the cleanup once per the catalog_cache_prune_min_age interval, so
>> with the default value (5 minutes) the entries might be between 5 and 10
>> minutes old. That's probably acceptable, although for higher values the
>> range gets wider and wider ...
> 
> That's true, but I don't know that it matters.  I'm not sure there's
> much of a use case for raising this parameter to some larger value,
> but even if there is, is it really worth complicating the mechanism to
> make sure that we throw away entries in a more timely fashion?  That's
> not going to be cost-free, either in terms of CPU cycles or in terms
> of code complexity.
> 

True, although it very much depends on how expensive it would be.

> Again, I think our goal should be to add the least mechanism here that
> solves the problem.  If we can show that a true/false flag makes poor
> decisions about which entries to evict and a smarter algorithm does
> better, then it's worth considering.  However, my bet is that it makes
> no meaningful difference.
> 

True.

>> Which part of the LRU approach is supposedly expensive? Updating the
>> lastaccess field or moving the entries to tail? I'd guess it's the
>> latter, so perhaps we can keep some sort of age field, update it less
>> frequently (once per minute?), and do the clock sweep?
> 
> Move to tail (although lastaccess would be expensive if too if it
> involves an extra gettimeofday() call).  GCLOCK, like we use for
> shared_buffers, is a common approximation of LRU which tends to be a
> lot less expensive to implement.  We could do that here and it might
> work well, but I think the question, again, is whether we really need
> it.  I think our goal here should just be to jettison cache entries
> that are clearly worthless.  It's expensive enough to reload cache
> entries that any kind of aggressive eviction policy is probably a
> loser, and if our goal is just to get rid of the stuff that's clearly
> not being used, we don't need to be super-accurate about it.
> 

True.

>> BTW wasn't one of the cases this thread aimed to improve a session that
>> accesses a lot of objects in a short period of time? That balloons the
>> syscache, and while this patch evicts the entries from memory, we never
>> actually release the memory back (because AllocSet just moves it into
>> the freelists) and it's unlikely to get swapped out (because other
>> chunks on those memory pages are likely to be still used). I've proposed
>> to address that by recreating the context if it gets too bloated, and I
>> think Alvaro agreed with that. But I haven't seen any further discussion
>> about that.
> 
> That's an interesting point.  It seems reasonable to me to just throw
> away everything and release all memory if the session has been idle
> for a while, but if the session is busy doing stuff, discarding
> everything in bulk like that is going to cause latency spikes.
> 

What I had in mind is more along these lines:

(a) track number of active syscache entries (increment when adding a new
one, decrement when evicting one)

(b) track peak number of active syscache entries

(c) after clock-sweep, if (peak > K*active) where K=2 or K=4 or so, do a
memory context swap, i.e. create a new context, copy active entries over
and destroy the old one

That would at least free() the memory. Of course, the syscache entries
may have different sizes, so tracking just numbers of entries is just an
approximation. But I think it'd be enough.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Mar 6, 2019 at 6:18 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Which part of the LRU approach is supposedly expensive? Updating the
>> lastaccess field or moving the entries to tail? I'd guess it's the
>> latter, so perhaps we can keep some sort of age field, update it less
>> frequently (once per minute?), and do the clock sweep?

> Move to tail (although lastaccess would be expensive if too if it
> involves an extra gettimeofday() call).

As I recall, the big problem with the old LRU code was loss of
locality of access, in that in addition to the data associated with
hot syscache entries, you were necessarily also touching list link
fields associated with not-hot entries.  That's bad for the CPU cache.

A gettimeofday call (or any other kernel call) per syscache access
would be a complete disaster.

            regards, tom lane


Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I don't think this shows any regression, but perhaps we should do a
> microbenchmark isolating the syscache entirely?

Well, if we need the LRU list, then yeah I think a microbenchmark
would be a good idea to make sure we really understand what the impact
of that is going to be.  But if we don't need it and can just remove
it then we don't.

> What I had in mind is more along these lines:
>
> (a) track number of active syscache entries (increment when adding a new
> one, decrement when evicting one)
>
> (b) track peak number of active syscache entries
>
> (c) after clock-sweep, if (peak > K*active) where K=2 or K=4 or so, do a
> memory context swap, i.e. create a new context, copy active entries over
> and destroy the old one
>
> That would at least free() the memory. Of course, the syscache entries
> may have different sizes, so tracking just numbers of entries is just an
> approximation. But I think it'd be enough.

Yeah, that could be done.  I'm not sure how expensive it would be, and
I'm also not sure how much more effective it would be than what's
currently proposed in terms of actually freeing memory.  If you free
enough dead syscache entries, you might manage to give some memory
back to the OS: after all, there may be some locality there.  And even
if you don't, you'll at least prevent further growth, which might be
good enough.

We could consider doing some version of what has been proposed here
and the thing you're proposing here could later be implemented on top
of that.  I mean, evicting entries at all is a prerequisite to
copy-and-compact.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:
On 3/7/19 4:01 PM, Robert Haas wrote:
> On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I don't think this shows any regression, but perhaps we should do a
>> microbenchmark isolating the syscache entirely?
> 
> Well, if we need the LRU list, then yeah I think a microbenchmark
> would be a good idea to make sure we really understand what the impact
> of that is going to be.  But if we don't need it and can just remove
> it then we don't.
> 
>> What I had in mind is more along these lines:
>>
>> (a) track number of active syscache entries (increment when adding a new
>> one, decrement when evicting one)
>>
>> (b) track peak number of active syscache entries
>>
>> (c) after clock-sweep, if (peak > K*active) where K=2 or K=4 or so, do a
>> memory context swap, i.e. create a new context, copy active entries over
>> and destroy the old one
>>
>> That would at least free() the memory. Of course, the syscache entries
>> may have different sizes, so tracking just numbers of entries is just an
>> approximation. But I think it'd be enough.
> 
> Yeah, that could be done.  I'm not sure how expensive it would be, and
> I'm also not sure how much more effective it would be than what's
> currently proposed in terms of actually freeing memory.  If you free
> enough dead syscache entries, you might manage to give some memory
> back to the OS: after all, there may be some locality there.  And even
> if you don't, you'll at least prevent further growth, which might be
> good enough.
> 

I have my doubts about that happening in practice. It might happen for
some workloads, but I think the locality is rather unpredictable.

> We could consider doing some version of what has been proposed here
> and the thing you're proposing here could later be implemented on top
> of that.  I mean, evicting entries at all is a prerequisite to
> copy-and-compact.
> 

Sure. I'm not saying the patch must do this to make it committable.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Robert Haas [mailto:robertmhaas@gmail.com]
>On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>> I don't think this shows any regression, but perhaps we should do a
>> microbenchmark isolating the syscache entirely?
>
>Well, if we need the LRU list, then yeah I think a microbenchmark would be a good idea
>to make sure we really understand what the impact of that is going to be.  But if we
>don't need it and can just remove it then we don't.

Just to be sure, we introduced the LRU list in this thread to find the entries less than threshold time
without scanning the whole hash table. If hash table becomes large without LRU list, scanning time becomes slow.

Regards,
Takeshi Ideriha

RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Vladimir Sitnikov [mailto:sitnikov.vladimir@gmail.com]
>
>Robert> This email thread is really short on clear demonstrations that X
>Robert> or Y is useful.
>
>It is useful when the whole database does **not** crash, isn't it?
>
>Case A (==current PostgeSQL mode): syscache grows, then OOMkiller chimes in, kills
>the database process, and it leads to the complete cluster failure (all other PG
>processes terminate themselves).
>
>Case B (==limit syscache by 10MiB or whatever as Tsunakawa, Takayuki
>asks):  a single ill-behaved process works a bit slower and/or consumers more CPU
>than the other ones. The whole DB is still alive.
>
>I'm quite sure "case B" is much better for the end users and for the database
>administrators.
>
>So, +1 to Tsunakawa, Takayuki, it would be so great if there was a way to limit the
>memory consumption of a single process (e.g. syscache, workmem, etc, etc).
>
>Robert> However, memory usage is quite unpredictable.  It depends on how
>Robert> many backends are active
>
>The number of backends can be limited by ensuring a proper limits at application
>connection pool level and/or pgbouncer and/or things like that.
>
>Robert>how many copies of work_mem and/or  maintenance_work_mem are in
>Robert>use
>
>There might be other patches to cap the total use of
>work_mem/maintenance_work_mem,
>
>Robert>I don't think we
>Robert> can say that just imposing a limit on the size of the system
>Robert>caches is  going to be enough to reliably prevent an out of
>Robert>memory condition
>
>The less possibilities there are for OOM the better. Quite often it is much better to fail
>a single SQL rather than kill all the DB processes.

Yeah, I agree. This limit would be useful for such extreme situation. 

Regards,
Takeshi Ideriha

Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Mar 7, 2019 at 11:40 PM Ideriha, Takeshi
<ideriha.takeshi@jp.fujitsu.com> wrote:
> Just to be sure, we introduced the LRU list in this thread to find the entries less than threshold time
> without scanning the whole hash table. If hash table becomes large without LRU list, scanning time becomes slow.

Hmm.  So, it's a trade-off, right?  One option is to have an LRU list,
which imposes a small overhead on every syscache or catcache operation
to maintain the LRU ordering.  The other option is to have no LRU
list, which imposes a larger overhead every time we clean up the
syscaches.  My bias is toward thinking that the latter is better,
because:

1. Not everybody is going to use this feature, and

2. Syscache cleanup should be something that only happens every so
many minutes, and probably while the backend is otherwise idle,
whereas lookups can happen many times per millisecond.

However, perhaps someone will provide some evidence that casts a
different light on the situation.

I don't see much point in continuing to review this patch at this
point.  There's been no new version of the patch in 3 weeks, and there
is -- in my view at least -- a rather frustrating lack of evidence
that the complexity this patch introduces is actually beneficial.  No
matter how many people +1 the idea of making this more complicated, it
can't be justified unless you can provide a test result showing that
the additional complexity solves a problem that does not get solved
without that complexity.  And even then, who is going to commit a
patch that uses a design which Tom Lane says was tried before and
stunk?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Hello. Sorry for being late a bit.

At Wed, 27 Mar 2019 17:30:37 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190327.173037.40342566.horiguchi.kyotaro@lab.ntt.co.jp>
> > I don't see much point in continuing to review this patch at this
> > point.  There's been no new version of the patch in 3 weeks, and there
> > is -- in my view at least -- a rather frustrating lack of evidence
> > that the complexity this patch introduces is actually beneficial.  No
> > matter how many people +1 the idea of making this more complicated, it
> > can't be justified unless you can provide a test result showing that
> > the additional complexity solves a problem that does not get solved
> > without that complexity.  And even then, who is going to commit a
> > patch that uses a design which Tom Lane says was tried before and
> > stunk?
> 
> Hmm. Anyway it is hit by recent commit. I'll post a rebased
> version and a version reverted to do hole-scan. Then I'll take
> numbers as far as I can and will show the result.. tomorrow.

I took performance numbers for master and three versions of the
patch. Master, LRU, full-scan, modified full-scan. I noticed that
useless scan can be skipped in full-scan version so I added the
last versoin.

I ran three artificial test cases. The database is created by
gen_tbl.pl. Numbers are the average of the fastest five runs in
successive 15 runs.

Test cases are listed below.

1_0. About 3,000,000 negative entries are created in pg_statstic
  cache by scanning that many distinct columns. It is 3000 tables
  * 1001 columns. Pruning scans happen several times while a run
  but no entries are removed. This emulates the bloating phase of
  cache. catalog_cache_prune_min_age is default (300s).
  (access_tbl1.pl)

1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0,
  which means turning off.

2_0. Repeatedly access 1001 of the 3,000,000 entries 6000
  times. This emulates the stable cache case without having
  pruning. catalog_cache_prune_min_age is default (300s).
 (access_tbl2.pl)

2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0,
  which means turning off.

3_0. Scan over the 3,000,000 entries twice with setting prune_age
  to 10s. A run takes about 18 seconds on my box so fair amount
  of old entries are removed. This emulates the stable case with
  continuous pruning. (access_tbl3.pl)

2_1. Same to 3_0 except that catalog_cache_prune_min_age is 0,
  which means turning off.


The result follows.

     | master |  LRU   |  Full  |Full-mod|
-----|--------+--------+--------+--------+
 1_0 | 17.287 | 17.370 | 17.255 | 16.623 |
 1_1 | 17.287 | 17.063 | 16.336 | 17.192 |
 2_0 | 15.695 | 18.769 | 18.563 | 15.527 |
 2_1 | 15.695 | 18.603 | 18.498 | 18.487 |
 3_0 | 26.576 | 33.817 | 34.384 | 34.971 |
 3_1 | 26.576 | 27.462 | 26.202 | 26.368 |

The result of 2_0 and 2_1 seems strange, but I show you the
numbers at the present.

- Full-scan seems to have the smallest impact when turned off.

- Full-scan-mod seems to perform best in total. (as far as
  Full-mod-2_0 is wrong value..)

- LRU doesn't seem to outperform full scanning.

For your information I measured how long pruning takes time.

LRU        318318 out of 2097153 entries in 26ms:  0.08us/entry.
Full-scan  443443 out of 2097153 entreis in 184ms. 0.4us/entry.

LRU is actually fast to remove entries but the difference seems
to be canceled by the complexity of LRU maintenance.

As my conclusion, we should go with the Full-scan or
Full-scan-mod version. I conduct a further overnight test and
will see which is better.

I attached the test script set. It is used in the folling manner.

(start server)
# perl gen_tbl.pl | psql postgres
(stop server)
# sh run.sh 30 > log.txt   # 30 is repeat count
# perl process.pl
     | master |  LRU   |  Full  |Full-mod|
-----|--------+--------+--------+--------+
 1_0 | 16.711 | 17.647 | 16.767 | 17.256 |
...


The attached files are follow.

LRU versions patches.
  LRU-0001-Add-dlist_move_tail.patch
  LRU-0002-Remove-entries-that-haven-t-been-used-for-a-certain-.patch

Fullscn version patch.
  FullScan-0001-Remove-entries-that-haven-t-been-used-for-a-certain-.patch

Fullscn-mod version patch.
  FullScan-mod-0001-Remove-entries-that-haven-t-been-used-for-a-certain-.patch

test scripts.
  test_script.tar.gz


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 1c397d118a65d6b76282cc904c43ecfe97ee5329 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/2] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
     dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+    /* fast path if it's already at the tail */
+    if (head->head.prev == node)
+        return;
+
+    dlist_delete(node);
+    dlist_push_tail(head, node);
+
+    dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

From f7a132c2b4910908773c508ef356a07cc853fe79 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 1 Mar 2019 13:32:51 +0900
Subject: [PATCH 2/2] Remove entries that haven't been used for a certain time

Catcache entries happen to be left alone for several reasons. It is
not desirable that such useless entries eat up memory. Catcache
pruning feature removes entries that haven't been accessed for a
certain time before enlarging hash array.
---
 doc/src/sgml/config.sgml                      |  19 ++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/cache/catcache.c            | 122 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/utils/catcache.h                  |  18 ++++
 6 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d383de2512..4231235447 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1677,6 +1677,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age">
+      <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_prune_min_age</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+         Specifies the minimum amount of unused time in seconds at which a
+         system catalog cache entry is removed. -1 indicates that this feature
+         is disabled at all. The value defaults to 300 seconds (<literal>5
+         minutes</literal>). The entries that are not used for the duration
+         can be removed to prevent catalog cache from bloating with useless
+         entries.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f9ce3d8f22..acab473d34 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2575,6 +2576,7 @@ start_xact_command(void)
      * not desired, the timeout has to be disabled explicitly.
      */
     enable_statement_timeout();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 static void
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d05930bc4c..c8ee0c98fb 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -60,9 +61,24 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
+/*
+ * Minimum interval between two successive moves of a cache entry in LRU list,
+ * in microseconds.
+ */
+#define MIN_LRU_UPDATE_INTERVAL 100000    /* 100ms */
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -473,6 +489,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 
     /* delink from linked list */
     dlist_delete(&ct->cache_elem);
+    dlist_delete(&ct->lru_node);
 
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
@@ -833,6 +850,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    dlist_init(&cp->cc_lru_list);
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -850,9 +868,83 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            nremoved = 0;
+    dlist_mutable_iter    iter;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age == 0)
+        return false;
+
+    /* Scan over LRU to find entries to remove */
+    dlist_foreach_modify(iter, &cp->cc_lru_list)
+    {
+        CatCTup    *ct = dlist_container(CatCTup, lru_node, iter.cur);
+        long        entry_age;
+        int            us;
+
+        /* Don't remove referenced entries */
+        if (ct->refcount != 0 ||
+            (ct->c_list && ct->c_list->refcount != 0))
+            continue;
+
+        /*
+         * Calculate the duration from the time from the last access to
+         * the "current" time. catcacheclock is updated per-statement
+         * basis.
+         */
+        TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+        if (entry_age < catalog_cache_prune_min_age)
+        {
+            /*
+             * We don't have older entries, exit.  At least one removal
+             * prevents rehashing this time.
+             */
+            break;
+        }
+
+        /*
+         * Entries that are not accessed after the last pruning are removed in
+         * that seconds, and their lives are prolonged according to how many
+         * times they are accessed up to three times of the duration. We don't
+         * try shrink buckets since pruning effectively caps catcache
+         * expansion in the long term.
+         */
+        if (ct->naccess > 0)
+            ct->naccess--;
+        else
+        {
+            CatCacheRemoveCTup(cp, ct);
+            nremoved++;
+        }
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1264,6 +1356,20 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* prolong life of this entry */
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        /*
+         * Don't update LRU too frequently. We need to maintain the LRU even
+         * if pruning is inactive since it can be turned on on-session.
+         */
+        if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL)
+        {
+            ct->lastaccess = catcacheclock;
+            dlist_move_tail(&cache->cc_lru_list, &ct->lru_node);
+        }
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1888,19 +1994,29 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    dlist_push_tail(&cache->cc_lru_list, &ct->lru_node);
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
+    /* increase refcount so that the new entry survives pruning */
+    ct->refcount++;
+
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try removing infrequently used
+     * entries to make a room for the new entry. If failed, enlarge the bucket
+     * array instead.  Quite arbitrarily, we try this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
+    ct->refcount--;
+
     return ct;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index aa564d153a..e624c74bf9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2202,6 +2203,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cccb5f145a..fa117f0573 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,7 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#catalog_cache_prune_min_age = 300s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..a21c53644a 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    dlist_head    cc_lru_list;
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
+    dlist_node    lru_node;        /* LRU node */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +194,19 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 0a6691078f8af5f35cca194137768af7d08fa1d8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 1 Mar 2019 13:32:51 +0900
Subject: [PATCH] Remove entries that haven't been used for a certain time

Catcache entries happen to be left alone for several reasons. It is
not desirable that such useless entries eat up memory. Catcache
pruning feature removes entries that haven't been accessed for a
certain time before enlarging hash array.
---
 doc/src/sgml/config.sgml                      |  19 +++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/cache/catcache.c            | 105 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/utils/catcache.h                  |  16 ++++
 6 files changed, 152 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d383de2512..4231235447 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1677,6 +1677,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age">
+      <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_prune_min_age</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+         Specifies the minimum amount of unused time in seconds at which a
+         system catalog cache entry is removed. -1 indicates that this feature
+         is disabled at all. The value defaults to 300 seconds (<literal>5
+         minutes</literal>). The entries that are not used for the duration
+         can be removed to prevent catalog cache from bloating with useless
+         entries.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f9ce3d8f22..acab473d34 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2575,6 +2576,7 @@ start_xact_command(void)
      * not desired, the timeout has to be disabled explicitly.
      */
     enable_statement_timeout();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 static void
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d05930bc4c..52586bd415 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -60,9 +61,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -850,9 +860,83 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int    nremoved = 0;
+    int i;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age == 0)
+        return false;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long        entry_age;
+            int            us;
+
+            /* Don't remove referenced entries */
+            if (ct->refcount != 0 ||
+                (ct->c_list && ct->c_list->refcount != 0))
+                continue;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. catcacheclock is updated per-statement
+             * basis and additionaly udpated periodically during a long
+             * running query.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            if (entry_age < catalog_cache_prune_min_age)
+                continue;
+
+            /*
+             * Entries that are not accessed after the last pruning are
+             * removed in that seconds, and their lives are prolonged
+             * according to how many times they are accessed up to three times
+             * of the duration. We don't try shrink buckets since pruning
+             * effectively caps catcache expansion in the long term.
+             */
+            if (ct->naccess > 0)
+                ct->naccess--;
+            else
+            {
+                CatCacheRemoveCTup(cp, ct);
+                nremoved++;
+            }
+        }
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1264,6 +1348,12 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* prolong life of this entry */
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1888,19 +1978,28 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
+    /* increase refcount so that the new entry survives pruning */
+    ct->refcount++;
+
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try removing infrequently used
+     * entries to make a room for the new entry. If failed, enlarge the bucket
+     * array instead.  Quite arbitrarily, we try this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
+    ct->refcount--;
+
     return ct;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index aa564d153a..e624c74bf9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2202,6 +2203,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cccb5f145a..fa117f0573 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,7 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#catalog_cache_prune_min_age = 300s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..2134839ecf 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -119,6 +120,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,19 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From f2379fb8070420ea0880cfa74439744ade41dc3f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 1 Mar 2019 13:32:51 +0900
Subject: [PATCH] Remove entries that haven't been used for a certain time

Catcache entries happen to be left alone for several reasons. It is
not desirable that such useless entries eat up memory. Catcache
pruning feature removes entries that haven't been accessed for a
certain time before enlarging hash array.
---
 doc/src/sgml/config.sgml                      |  19 ++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/cache/catcache.c            | 121 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/utils/catcache.h                  |  17 ++++
 6 files changed, 169 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d383de2512..4231235447 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1677,6 +1677,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age">
+      <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_prune_min_age</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+         Specifies the minimum amount of unused time in seconds at which a
+         system catalog cache entry is removed. -1 indicates that this feature
+         is disabled at all. The value defaults to 300 seconds (<literal>5
+         minutes</literal>). The entries that are not used for the duration
+         can be removed to prevent catalog cache from bloating with useless
+         entries.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f9ce3d8f22..acab473d34 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2575,6 +2576,7 @@ start_xact_command(void)
      * not desired, the timeout has to be disabled explicitly.
      */
     enable_statement_timeout();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 static void
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d05930bc4c..c4582fe5a3 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -60,9 +61,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -850,9 +860,99 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int        nremoved = 0;
+    int        i;
+    long    oldest_ts = catcacheclock;
+    long    age;
+    int        us;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age == 0)
+        return false;
+
+    /* Don't scan the hash when we know we don't have prunable entries */
+    TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us);
+    if (age < catalog_cache_prune_min_age)
+        return false;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount == 0 &&
+                (ct->c_list == NULL || ct->c_list->refcount == 0))
+            {
+                /*
+                 * Calculate the duration from the time from the last access
+                 * to the "current" time. catcacheclock is updated
+                 * per-statement basis and additionaly udpated periodically
+                 * during a long running query.
+                 */
+                TimestampDifference(ct->lastaccess, catcacheclock, &age, &us);
+
+                if (age >= catalog_cache_prune_min_age)
+                {
+                    /*
+                     * Entries that are not accessed after the last pruning
+                     * are removed in that seconds, and their lives are
+                     * prolonged according to how many times they are accessed
+                     * up to three times of the duration. We don't try shrink
+                     * buckets since pruning effectively caps catcache
+                     * expansion in the long term.
+                     */
+                    if (ct->naccess > 0)
+                        ct->naccess--;
+                    else
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+
+                        /* don't update oldest_ts by removed entry */
+                        continue;
+                    }
+                }
+            }
+
+            /* update oldest timestamp if the entry remains alive */
+            if (ct->lastaccess < oldest_ts)
+                oldest_ts = ct->lastaccess;
+        }
+    }
+
+    cp->cc_oldest_ts = oldest_ts;
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1264,6 +1364,12 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* prolong life of this entry */
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1888,19 +1994,28 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
+    /* increase refcount so that the new entry survives pruning */
+    ct->refcount++;
+
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try removing infrequently used
+     * entries to make a room for the new entry. If failed, enlarge the bucket
+     * array instead.  Quite arbitrarily, we try this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
+    ct->refcount--;
+
     return ct;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index aa564d153a..e624c74bf9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2202,6 +2203,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cccb5f145a..fa117f0573 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,7 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#catalog_cache_prune_min_age = 300s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..2f697d5ca4 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    long        cc_oldest_ts;
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +193,19 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3


Вложения

Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Fri, 29 Mar 2019 17:24:40 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190329.172440.199616830.horiguchi.kyotaro@lab.ntt.co.jp>
> I ran three artificial test cases. The database is created by
> gen_tbl.pl. Numbers are the average of the fastest five runs in
> successive 15 runs.
> 
> Test cases are listed below.
> 
> 1_0. About 3,000,000 negative entries are created in pg_statstic
>   cache by scanning that many distinct columns. It is 3000 tables
>   * 1001 columns. Pruning scans happen several times while a run
>   but no entries are removed. This emulates the bloating phase of
>   cache. catalog_cache_prune_min_age is default (300s).
>   (access_tbl1.pl)
> 
> 1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0,
>   which means turning off.
> 
> 2_0. Repeatedly access 1001 of the 3,000,000 entries 6000
>   times. This emulates the stable cache case without having
>   pruning. catalog_cache_prune_min_age is default (300s).
>  (access_tbl2.pl)
> 
> 2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0,
>   which means turning off.
> 
> 3_0. Scan over the 3,000,000 entries twice with setting prune_age
>   to 10s. A run takes about 18 seconds on my box so fair amount
>   of old entries are removed. This emulates the stable case with
>   continuous pruning. (access_tbl3.pl)
> 
> 2_1. Same to 3_0 except that catalog_cache_prune_min_age is 0,
>   which means turning off.
> 
> 
> The result follows.
> 
>      | master |  LRU   |  Full  |Full-mod|
> -----|--------+--------+--------+--------+
>  1_0 | 17.287 | 17.370 | 17.255 | 16.623 |
>  1_1 | 17.287 | 17.063 | 16.336 | 17.192 |
>  2_0 | 15.695 | 18.769 | 18.563 | 15.527 |
>  2_1 | 15.695 | 18.603 | 18.498 | 18.487 |
>  3_0 | 26.576 | 33.817 | 34.384 | 34.971 |
>  3_1 | 26.576 | 27.462 | 26.202 | 26.368 |
> 
> The result of 2_0 and 2_1 seems strange, but I show you the
> numbers at the present.
> 
> - Full-scan seems to have the smallest impact when turned off.
> 
> - Full-scan-mod seems to perform best in total. (as far as
>   Full-mod-2_0 is wrong value..)
> 
> - LRU doesn't seem to outperform full scanning.

I had another.. unstable..  result.

     | master |  LRU   |  Full  |Full-mod|
-----|--------+--------+--------+--------+
 1_0 | 16.312 | 16.540 | 16.482 | 16.348 |
 1_1 | 16.312 | 16.454 | 16.335 | 16.232 |
 2_0 | 16.710 | 16.954 | 17.873 | 17.345 |
 2_1 | 16.710 | 17.373 | 18.499 | 17.563 |
 3_0 | 25.010 | 33.031 | 33.452 | 33.937 |
 3_1 | 25.010 | 24.784 | 24.570 | 25.453 |


Normalizing on master's result and rounding off to 1.0%, it looks
as:

     | master |  LRU   |  Full  |Full-mod|  Test description
-----|--------+--------+--------+--------+-----------------------------------
 1_0 |   100  |   101  |   101  |   100  |   bloating. pruning enabled.
 1_1 |   100  |   101  |   100  |   100  |   bloating. pruning disabled.
 2_0 |   100  |   101  |   107  |   104  |   normal access. pruning enabled.
 2_1 |   100  |   104  |   111  |   105  |   normal access. pruning disabled.
 3_0 |   100  |   132  |   134  |   136  |   pruning continuously running.
 3_1 |   100  |    99  |    98  |   102  |   pruning disabled.

I'm not sure why the 2_1 is slower than 2_0, but LRU impacts
least if the numbers are right.

I will investigate the strange behavior using profiler.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Mon, 01 Apr 2019 11:05:32 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190401.110532.102998353.horiguchi.kyotaro@lab.ntt.co.jp>
> At Fri, 29 Mar 2019 17:24:40 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20190329.172440.199616830.horiguchi.kyotaro@lab.ntt.co.jp>
 
> > I ran three artificial test cases. The database is created by
> > gen_tbl.pl. Numbers are the average of the fastest five runs in
> > successive 15 runs.
> > 
> > Test cases are listed below.
> > 
> > 1_0. About 3,000,000 negative entries are created in pg_statstic
> >   cache by scanning that many distinct columns. It is 3000 tables
> >   * 1001 columns. Pruning scans happen several times while a run
> >   but no entries are removed. This emulates the bloating phase of
> >   cache. catalog_cache_prune_min_age is default (300s).
> >   (access_tbl1.pl)
> > 
> > 1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0,
> >   which means turning off.
> > 
> > 2_0. Repeatedly access 1001 of the 3,000,000 entries 6000
> >   times. This emulates the stable cache case without having
> >   pruning. catalog_cache_prune_min_age is default (300s).
> >  (access_tbl2.pl)
> > 
> > 2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0,
> >   which means turning off.
> > 
> > 3_0. Scan over the 3,000,000 entries twice with setting prune_age
> >   to 10s. A run takes about 18 seconds on my box so fair amount
> >   of old entries are removed. This emulates the stable case with
> >   continuous pruning. (access_tbl3.pl)
> > 
> > 3_1. Same to 3_0 except that catalog_cache_prune_min_age is 0,
> >   which means turning off.
..
> I had another.. unstable..  result.

dlist_move_head is used every time an entry is accessed. It moves
the accessed element to the top of bucket expecting that
subsequent access become faster - a kind of LRU maintenance. But
the mean length of a bucket is 2 so dlist_move_head is too
complex than following one step of link. So I removed it in
pruning patch.

I understand I cannot get rid of noise a far as I'm poking the
feature from client via communication and SQL layer.

The attached extension surgically exercises
SearchSysCache3(STATRELATTINH) in the almost pattern with the
benchmarks taken last week.  I believe that that gives far
reliable numbers. But still the number fluctuates by up to about
10% every trial, and the difference among the methods is under
the fluctulation. I'm tired.. But this still looks somewhat wrong.

ratio in the following table is the percentage to the master for
the same test. master2 is a version that removed the
dlink_move_head from master.

 binary  | test | count |   avg   | stddev | ratio
---------+------+-------+---------+--------+--------
 master  | 1_0  |     5 | 7841.42 |   6.91
 master  | 2_0  |     5 | 3810.10 |   8.51
 master  | 3_0  |     5 | 7826.17 |  11.98
 master  | 1_1  |     5 | 7905.73 |   5.69
 master  | 2_1  |     5 | 3827.15 |   5.55
 master  | 3_1  |     5 | 7822.67 |  13.75
---------+------+-------+---------+--------+--------
 master2 | 1_0  |     5 | 7538.05 |  16.65 |  96.13
 master2 | 2_0  |     5 | 3927.05 |  11.58 | 103.07
 master2 | 3_0  |     5 | 7455.47 |  12.03 |  95.26
 master2 | 1_1  |     5 | 7485.60 |   9.38 |  94.69
 master2 | 2_1  |     5 | 3870.81 |   5.54 | 101.14
 master2 | 3_1  |     5 | 7437.35 |   9.91 |  95.74
---------+------+-------+---------+--------+--------
 LRU     | 1_0  |     5 | 7633.57 |   9.00 |  97.35
 LRU     | 2_0  |     5 | 4062.43 |   5.90 | 106.62
 LRU     | 3_0  |     5 | 8340.51 |   6.12 | 106.57
 LRU     | 1_1  |     5 | 7645.87 |  13.29 |  96.71
 LRU     | 2_1  |     5 | 4026.60 |   7.56 | 105.21
 LRU     | 3_1  |     5 | 8400.10 |  19.07 | 107.38
---------+------+-------+---------+--------+--------
 Full    | 1_0  |     5 | 7481.61 |   6.70 |  95.41
 Full    | 2_0  |     5 | 4084.46 |  14.50 | 107.20
 Full    | 3_0  |     5 | 8166.23 |  14.80 | 104.35
 Full    | 1_1  |     5 | 7447.20 |  10.93 |  94.20
 Full    | 2_1  |     5 | 4016.88 |   8.53 | 104.96
 Full    | 3_1  |     5 | 8258.80 |   7.91 | 105.58
---------+------+-------+---------+--------+--------
 FullMod | 1_0  |     5 | 7291.80 |  14.03 |  92.99
 FullMod | 2_0  |     5 | 4006.36 |   7.64 | 105.15
 FullMod | 3_0  |     5 | 8143.60 |   9.26 | 104.06
 FullMod | 1_1  |     5 | 7270.66 |   6.24 |  91.97
 FullMod | 2_1  |     5 | 3996.20 |  13.00 | 104.42
 FullMod | 3_1  |     5 | 8012.55 |   7.09 | 102 43



So "Full (scan) Mod" wins again, or the diffence is under error.

I don't think this level of difference can be a reason to reject
this kind of resource saving mechanism. LRU version doesn't seem
particularly slow but also doesn't seem particularly fast for the
complexity. FullMod version doesn't look differently.

So it seems to me that the simplest "Full" version wins. The
attached is rebsaed version. dlist_move_head(entry) is removed as
mentioned above in that patch.

The third and fourth attached are a set of script I used.

$ perl gen_tbl.pl | psql postgres
$ run.sh > log.txt

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 57c9dab7fff7b81890657594711bbfb47a3e0f0d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 1 Mar 2019 13:32:51 +0900
Subject: [PATCH 1/2] Remove entries that haven't been used for a certain time

Catcache entries happen to be left alone for several reasons. It is
not desirable that such useless entries eat up memory. Catcache
pruning feature removes entries that haven't been accessed for a
certain time before enlarging hash array.
---
 doc/src/sgml/config.sgml                      |  19 ++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/cache/catcache.c            | 124 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/utils/catcache.h                  |  18 ++++
 6 files changed, 172 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bc1d0f7bfa..819b252029 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1677,6 +1677,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age">
+      <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_prune_min_age</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+         Specifies the minimum amount of unused time in seconds at which a
+         system catalog cache entry is removed. -1 indicates that this feature
+         is disabled at all. The value defaults to 300 seconds (<literal>5
+         minutes</literal>). The entries that are not used for the duration
+         can be removed to prevent catalog cache from bloating with useless
+         entries.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1d4f..a0efac86bc 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2577,6 +2578,7 @@ start_xact_command(void)
      * not desired, the timeout has to be disabled explicitly.
      */
     enable_statement_timeout();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 static void
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d05930bc4c..e85f2b038c 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -60,9 +61,24 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
+/*
+ * Minimum interval between two successive moves of a cache entry in LRU list,
+ * in microseconds.
+ */
+#define MIN_LRU_UPDATE_INTERVAL 100000    /* 100ms */
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -473,6 +489,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 
     /* delink from linked list */
     dlist_delete(&ct->cache_elem);
+    dlist_delete(&ct->lru_node);
 
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
@@ -833,6 +850,7 @@ InitCatCache(int id,
     cp->cc_nkeys = nkeys;
     for (i = 0; i < nkeys; ++i)
         cp->cc_keyno[i] = key[i];
+    dlist_init(&cp->cc_lru_list);
 
     /*
      * new cache is initialized as far as we can go for now. print some
@@ -850,9 +868,83 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int            nremoved = 0;
+    dlist_mutable_iter    iter;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age == 0)
+        return false;
+
+    /* Scan over LRU to find entries to remove */
+    dlist_foreach_modify(iter, &cp->cc_lru_list)
+    {
+        CatCTup    *ct = dlist_container(CatCTup, lru_node, iter.cur);
+        long        entry_age;
+        int            us;
+
+        /* Don't remove referenced entries */
+        if (ct->refcount != 0 ||
+            (ct->c_list && ct->c_list->refcount != 0))
+            continue;
+
+        /*
+         * Calculate the duration from the time from the last access to
+         * the "current" time. catcacheclock is updated per-statement
+         * basis.
+         */
+        TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+        if (entry_age < catalog_cache_prune_min_age)
+        {
+            /*
+             * We don't have older entries, exit.  At least one removal
+             * prevents rehashing this time.
+             */
+            break;
+        }
+
+        /*
+         * Entries that are not accessed after the last pruning are removed in
+         * that seconds, and their lives are prolonged according to how many
+         * times they are accessed up to three times of the duration. We don't
+         * try shrink buckets since pruning effectively caps catcache
+         * expansion in the long term.
+         */
+        if (ct->naccess > 0)
+            ct->naccess--;
+        else
+        {
+            CatCacheRemoveCTup(cp, ct);
+            nremoved++;
+        }
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1262,7 +1354,21 @@ SearchCatCacheInternal(CatCache *cache,
          * most frequently accessed elements in any hashbucket will tend to be
          * near the front of the hashbucket's list.)
          */
-        dlist_move_head(bucket, &ct->cache_elem);
+        /* dlist_move_head(bucket, &ct->cache_elem);*/
+
+        /* prolong life of this entry */
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        /*
+         * Don't update LRU too frequently. We need to maintain the LRU even
+         * if pruning is inactive since it can be turned on on-session.
+         */
+        if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL)
+        {
+            ct->lastaccess = catcacheclock;
+            dlist_move_tail(&cache->cc_lru_list, &ct->lru_node);
+        }
 
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
@@ -1888,19 +1994,29 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
+    dlist_push_tail(&cache->cc_lru_list, &ct->lru_node);
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
+    /* increase refcount so that the new entry survives pruning */
+    ct->refcount++;
+
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try removing infrequently used
+     * entries to make a room for the new entry. If failed, enlarge the bucket
+     * array instead.  Quite arbitrarily, we try this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
+    ct->refcount--;
+
     return ct;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1766e46037..e671d4428e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2249,6 +2250,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bbbeb4bb15..d88ec57382 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,7 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#catalog_cache_prune_min_age = 300s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..a21c53644a 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    dlist_head    cc_lru_list;
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,9 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
+    dlist_node    lru_node;        /* LRU node */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +194,19 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From ac4a9dc1bb822f9df36d453354b953d2b383545d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 4 Apr 2019 21:16:17 +0900
Subject: [PATCH 2/2] Benchmark extension for catcache pruning feature.

This extension surgically exercises CatCacheSearch() on STATRELATTINH
and returns the duration in milliseconds.
---
 contrib/catcachebench/Makefile               |  17 ++
 contrib/catcachebench/catcachebench--0.0.sql |   9 ++
 contrib/catcachebench/catcachebench.c        | 229 +++++++++++++++++++++++++++
 contrib/catcachebench/catcachebench.control  |   6 +
 4 files changed, 261 insertions(+)
 create mode 100644 contrib/catcachebench/Makefile
 create mode 100644 contrib/catcachebench/catcachebench--0.0.sql
 create mode 100644 contrib/catcachebench/catcachebench.c
 create mode 100644 contrib/catcachebench/catcachebench.control

diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile
new file mode 100644
index 0000000000..0478818b25
--- /dev/null
+++ b/contrib/catcachebench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = catcachebench
+OBJS = catcachebench.o
+
+EXTENSION = catcachebench
+DATA = catcachebench--0.0.sql
+PGFILEDESC = "catcachebench - benchmark for catcache pruning feature"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/catcachebench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql
new file mode 100644
index 0000000000..e091baaaa7
--- /dev/null
+++ b/contrib/catcachebench/catcachebench--0.0.sql
@@ -0,0 +1,9 @@
+/* contrib/catcachebench/catcachebench--0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit
+
+CREATE FUNCTION catcachebench(IN type int)
+RETURNS double precision
+AS 'MODULE_PATHNAME', 'catcachebench'
+LANGUAGE C STRICT VOLATILE;
diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
new file mode 100644
index 0000000000..36d21d13c1
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.c
@@ -0,0 +1,229 @@
+/*
+ * catcachebench: test code for cache pruning feature
+ */
+#include "postgres.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+
+Oid        tableoids[10000];
+int        ntables = 0;
+int16    attnums[1000];
+int        natts = 0;
+
+PG_MODULE_MAGIC;
+
+double catcachebench1(void);
+double catcachebench2(void);
+double catcachebench3(void);
+void collectinfo(void);
+void catcachewarmup(void);
+
+PG_FUNCTION_INFO_V1(catcachebench);
+
+Datum
+catcachebench(PG_FUNCTION_ARGS)
+{
+    int        testtype = PG_GETARG_INT32(0);
+    double    ms;
+
+    collectinfo();
+
+    /* flush the catalog -- safe? don't mind. */
+    CatalogCacheFlushCatalog(StatisticRelationId);
+
+    switch (testtype)
+    {
+    case 0:
+        catcachewarmup(); /* prewarm of syscatalog */
+        PG_RETURN_NULL();
+    case 1:
+        ms = catcachebench1(); break;
+    case 2:
+        ms = catcachebench2(); break;
+    case 3:
+        ms = catcachebench3(); break;
+    default:
+        elog(ERROR, "Invalid test type: %d", testtype);
+    }
+
+    PG_RETURN_DATUM(Float8GetDatum(ms));
+}
+
+double
+catcachebench1(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+double
+catcachebench2(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < 60000 ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[0]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+double
+catcachebench3(void)
+{
+    int i, t, a;
+    instr_time    start,
+                duration;
+
+    INSTR_TIME_SET_CURRENT(start);
+    for (i = 0 ; i < 2 ; i++)
+    {
+        for (t = 0 ; t < ntables ; t++)
+        {
+            for (a = 0 ; a < natts ; a++)
+            {
+                HeapTuple tup;
+
+                tup = SearchSysCache3(STATRELATTINH,
+                                      ObjectIdGetDatum(tableoids[t]),
+                                      Int16GetDatum(attnums[a]),
+                                      BoolGetDatum(false));
+                /* should be null, but.. */
+                if (HeapTupleIsValid(tup))
+                    ReleaseSysCache(tup);
+            }
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+void
+catcachewarmup(void)
+{
+    int t, a;
+
+    /* load up catalog tables */
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+}
+
+void
+collectinfo(void)
+{
+    int ret;
+    Datum    values[10000];
+    bool    nulls[10000];
+    Oid        types0[] = {OIDOID};
+    int i;
+
+    ntables = 0;
+    natts = 0;
+
+    SPI_connect();
+    /* collect target tables */
+    ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname =
\'test\')",
+                      true, 0);
+    if (ret != SPI_OK_SELECT)
+        elog(ERROR, "Failed 1");
+    if (SPI_processed == 0)
+        elog(ERROR, "no relation found in schema \"test\"");
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in schema \"test\"");
+
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 2");
+
+        tableoids[ntables++] = DatumGetObjectId(values[0]);
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d tables found", ntables);
+
+    values[0] = ObjectIdGetDatum(tableoids[0]);
+    nulls[0] = false;
+    SPI_connect();
+    ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid
=$1)",
 
+                                1, types0, values, NULL, true, 0);
+    if (SPI_processed == 0)
+        elog(ERROR, "no attribute found in table %d", tableoids[0]);
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in table %d", tableoids[0]);
+    
+    /* collect target attributes. assuming all tables have the same attnums */
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        int16 attnum;
+
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 3");
+        attnum = DatumGetInt16(values[0]);
+
+        if (attnum > 0)
+            attnums[natts++] = attnum;
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d attributes found", natts);
+}
diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control
new file mode 100644
index 0000000000..3fc9d2e420
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.control
@@ -0,0 +1,6 @@
+# catcachebench
+
+comment = 'benchmark for catcache pruning'
+default_version = '0.0'
+module_pathname = '$libdir/catcachebench'
+relocatable = true
-- 
2.16.3

#! /usr/bin/perl
$collist = "";
foreach $i (0..1000) {
    $collist .= sprintf(", c%05d int", $i);
}
$collist = substr($collist, 2);

printf "drop schema if exists test cascade;\n";
printf "create schema test;\n";
foreach $i (0..2999) {
    printf "create table test.t%04d ($collist);\n", $i;
}
#!/bin/bash
LOOPS=5
BINROOT=/home/horiguti/bin
DATADIR=/home/horiguti/data/data_work_o2
PREC="numeric(10,2)"

killall postgres
sleep 3

run() {
    local BINARY=$1
    local PGCTL=$2/bin/pg_ctl

    if [ "$3" != "" ]; then
      local SETTING1="set catalog_cache_prune_min_age to \"$3\";"
      local SETTING2="set catalog_cache_prune_min_age to \"$4\";"
      local SETTING3="set catalog_cache_prune_min_age to \"$5\";"
    fi

    $PGCTL --pgdata=$DATADIR start
    psql postgres -e <<EOF
create extension if not exists catcachebench;
select catcachebench(0);

$SETTING1

select '${BINARY}' as binary, '1_0' as test, count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select
catcachebench(1)from generate_series(1, ${LOOPS})) as a(a)
 
UNION ALL select '${BINARY}', '2_0' , count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select catcachebench(2) from
generate_series(1,${LOOPS})) as a(a);
 

$SETTING2

select '${BINARY}' as binary, '3_0' as test, count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select
catcachebench(3)from generate_series(1, ${LOOPS})) as a(a);
 

$SETTING3

select '${BINARY}' as binary, '1_1' as test, count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select
catcachebench(1)from generate_series(1, ${LOOPS})) as a(a)
 
UNION ALL select '${BINARY}', '2_1' , count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select catcachebench(2) from
generate_series(1,${LOOPS})) as a(a)
 
UNION ALL select '${BINARY}', '3_1' , count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select catcachebench(3) from
generate_series(1,${LOOPS})) as a(a);
 

EOF
    $PGCTL --pgdata=$DATADIR stop
}

run "master" $BINROOT/pgsql_work_o2 "" "" ""
run "master2" $BINROOT/pgsql_mater_o2m "" "" ""
run "LRU" $BINROOT/pgsql_catexp8_1 "300s" "1s" "0"
run "Full" $BINROOT/pgsql_catexp8_2 "300s" "1s" "0"
run "FullMod" $BINROOT/pgsql_catexp8_3 "300s" "1s" "0"

Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Thu, Apr 4, 2019 at 8:53 AM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> So it seems to me that the simplest "Full" version wins. The
> attached is rebsaed version. dlist_move_head(entry) is removed as
> mentioned above in that patch.

1. I really don't think this patch has any business changing the
existing logic.  You can't just assume that the dlist_move_head()
operation is unimportant for performance.

2. This patch still seems to add a new LRU list that has to be
maintained.  That's fairly puzzling.  You seem to have concluded that
the version without the additional LRU wins, but the sent a new copy
of the version with the LRU version.

3. I don't think adding an additional call to GetCurrentTimestamp() in
start_xact_command() is likely to be acceptable.  There has got to be
a way to set this up so that the maximum number of new
GetCurrentTimestamp() is limited to once per N seconds, vs. the
current implementation that could do it many many many times per
second.

4. The code in CatalogCacheCreateEntry seems clearly unacceptable.  In
a pathological case where CatCacheCleanupOldEntries removes exactly
one element per cycle, it could be called on every new catcache
allocation.

I think we need to punt this patch to next release.  We're not
converging on anything committable very fast.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
Thank you for the comment.

At Thu, 4 Apr 2019 15:44:35 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoZQx7pCcc=VO3WeDQNpco8h6MZN09KjcOMRRu_CrbeoSw@mail.gmail.com>
> On Thu, Apr 4, 2019 at 8:53 AM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > So it seems to me that the simplest "Full" version wins. The
> > attached is rebsaed version. dlist_move_head(entry) is removed as
> > mentioned above in that patch.
> 
> 1. I really don't think this patch has any business changing the
> existing logic.  You can't just assume that the dlist_move_head()
> operation is unimportant for performance.

Ok, it doesn't show significant performance gain so removed that.

> 2. This patch still seems to add a new LRU list that has to be
> maintained.  That's fairly puzzling.  You seem to have concluded that
> the version without the additional LRU wins, but the sent a new copy
> of the version with the LRU version.

Sorry, I attached wrong one. The attached is the right one, which
doesn't adds the new dlist.

> 3. I don't think adding an additional call to GetCurrentTimestamp() in
> start_xact_command() is likely to be acceptable.  There has got to be
> a way to set this up so that the maximum number of new
> GetCurrentTimestamp() is limited to once per N seconds, vs. the
> current implementation that could do it many many many times per
> second.

The GetCurrentTimestamp() is called only once at very early in
the backend's life in InitPostgres. Not in
start_xact_command. What I did in the function is just copying
stmtStartTimstamp, not GetCurrentTimestamp().

> 4. The code in CatalogCacheCreateEntry seems clearly unacceptable.  In
> a pathological case where CatCacheCleanupOldEntries removes exactly
> one element per cycle, it could be called on every new catcache
> allocation.

It may be a problem, if just one entry was created in the
duration longer than by catalog_cache_prune_min_age and resize
interval, or all candidate entries except one are actually in use
at the pruning moment. Is it realistic?

> I think we need to punt this patch to next release.  We're not
> converging on anything committable very fast.

Yeah, maybe right. This patch had several month silence several
times, got comments and modified taking in the comments for more
than two cycles, and finally had a death sentence (not literaly,
actually postpone) at very close to this third cycle end. I
anticipate the same continues in the next cycle.

By the way, I found the reason of the wrong result of the
previous benchmark. The test 3_0/1 needs to update catcacheclock
midst of the loop. I'm going to fix it and rerun it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 596d6b018e1b7ddd5828298bfaba3ee405eb2604 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 1 Mar 2019 13:32:51 +0900
Subject: [PATCH] Remove entries that haven't been used for a certain time

Catcache entries happen to be left alone for several reasons. It is
not desirable that such useless entries eat up memory. Catcache
pruning feature removes entries that haven't been accessed for a
certain time before enlarging hash array.
---
 doc/src/sgml/config.sgml                      |  19 +++++
 src/backend/tcop/postgres.c                   |   2 +
 src/backend/utils/cache/catcache.c            | 103 +++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c                  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/utils/catcache.h                  |  16 ++++
 6 files changed, 150 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bc1d0f7bfa..819b252029 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1677,6 +1677,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age">
+      <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>catalog_cache_prune_min_age</varname> configuration
+       parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+         Specifies the minimum amount of unused time in seconds at which a
+         system catalog cache entry is removed. -1 indicates that this feature
+         is disabled at all. The value defaults to 300 seconds (<literal>5
+         minutes</literal>). The entries that are not used for the duration
+         can be removed to prevent catalog cache from bloating with useless
+         entries.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth">
       <term><varname>max_stack_depth</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1d4f..a0efac86bc 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -2577,6 +2578,7 @@ start_xact_command(void)
      * not desired, the timeout has to be disabled explicitly.
      */
     enable_statement_timeout();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 static void
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d05930bc4c..03c2d8524c 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -60,9 +61,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -850,9 +860,83 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int    nremoved = 0;
+    int i;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age == 0)
+        return false;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long        entry_age;
+            int            us;
+
+            /* Don't remove referenced entries */
+            if (ct->refcount != 0 ||
+                (ct->c_list && ct->c_list->refcount != 0))
+                continue;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. catcacheclock is updated per-statement
+             * basis and additionaly udpated periodically during a long
+             * running query.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            if (entry_age < catalog_cache_prune_min_age)
+                continue;
+
+            /*
+             * Entries that are not accessed after the last pruning are
+             * removed in that seconds, and their lives are prolonged
+             * according to how many times they are accessed up to three times
+             * of the duration. We don't try shrink buckets since pruning
+             * effectively caps catcache expansion in the long term.
+             */
+            if (ct->naccess > 0)
+                ct->naccess--;
+            else
+            {
+                CatCacheRemoveCTup(cp, ct);
+                nremoved++;
+            }
+        }
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1263,6 +1347,10 @@ SearchCatCacheInternal(CatCache *cache,
          * near the front of the hashbucket's list.)
          */
         dlist_move_head(bucket, &ct->cache_elem);
+        if (ct->naccess < 2)
+            ct->naccess++;
+
+        ct->lastaccess = catcacheclock;
 
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
@@ -1888,19 +1976,28 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
     cache->cc_ntup++;
     CacheHdr->ch_ntup++;
 
+    /* increase refcount so that the new entry survives pruning */
+    ct->refcount++;
+
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
+     * If the hash table has become too full, try removing infrequently used
+     * entries to make a room for the new entry. If failed, enlarge the bucket
+     * array instead.  Quite arbitrarily, we try this when fill factor > 2.
      */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
+    if (cache->cc_ntup > cache->cc_nbuckets * 2 &&
+        !CatCacheCleanupOldEntries(cache))
         RehashCatCache(cache);
 
+    ct->refcount--;
+
     return ct;
 }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1766e46037..e671d4428e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2249,6 +2250,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."), 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index bbbeb4bb15..d88ec57382 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -128,6 +128,7 @@
 #work_mem = 4MB                # min 64kB
 #maintenance_work_mem = 64MB        # min 1MB
 #autovacuum_work_mem = -1        # min 1MB, or -1 to use maintenance_work_mem
+#catalog_cache_prune_min_age = 300s    # -1 disables pruning
 #max_stack_depth = 2MB            # min 100kB
 #shared_memory_type = mmap        # the default is the first option
                     # supported by the operating system:
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..2134839ecf 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -119,6 +120,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,19 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro HORIGUCHI
Дата:
At Fri, 05 Apr 2019 09:44:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190405.094407.151644324.horiguchi.kyotaro@lab.ntt.co.jp>
> By the way, I found the reason of the wrong result of the
> previous benchmark. The test 3_0/1 needs to update catcacheclock
> midst of the loop. I'm going to fix it and rerun it.

I found the cause. CataloCacheFlushCatalog() doesn't shring the
hash. So no resize happens once it is bloated. I needed another
version of the function that reset the cc_bucket to the initial
size.

Using the new debug function, I got better numbers.

I focused on the performance when disabled. I rechecked that by
adding the patch part-by-part and identified several causes of
the degradaton. I did:

- MovedpSetCatCacheClock() to AtStart_Cache()
- Maybe improved the caller site of CatCacheCleanupOldEntries().

As the result:

 binary | test | count |   avg   | stddev | 
--------+------+-------+---------+--------+-------
 master | 1_1  |     5 | 7104.90 |   4.40 | 
 master | 2_1  |     5 | 3759.26 |   4.20 | 
 master | 3_1  |     5 | 7954.05 |   2.15 | 
--------+------+-------+---------+--------+-------
 Full   | 1_1  |     5 | 7237.20 |   7.98 | 101.87
 Full   | 2_1  |     5 | 4050.98 |   8.42 | 107.76
 Full   | 3_1  |     5 | 8192.87 |   3.28 | 103.00

But, still fluctulates by around 5%..

If this level of the degradation is still not acceptable, that
means that nothing can be inserted in the code path and the new
code path should be isolated from existing code by using indirect
call.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index bd5024ef00..a9414c0c07 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1067,6 +1067,7 @@ static void
 AtStart_Cache(void)
 {
     AcceptInvalidationMessages();
+    SetCatCacheClock(GetCurrentStatementStartTimestamp());
 }
 
 /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1d4f..4d849aeb4c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -71,6 +71,7 @@
 #include "tcop/pquery.h"
 #include "tcop/tcopprot.h"
 #include "tcop/utility.h"
+#include "utils/catcache.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d05930bc4c..91814f7210 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -60,9 +60,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 0;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1, Datum v2,
@@ -850,9 +859,83 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int    nremoved = 0;
+    int i;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age == 0)
+        return false;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+            long        entry_age;
+            int            us;
+
+            /* Don't remove referenced entries */
+            if (ct->refcount != 0 ||
+                (ct->c_list && ct->c_list->refcount != 0))
+                continue;
+
+            /*
+             * Calculate the duration from the time from the last access to
+             * the "current" time. catcacheclock is updated per-statement
+             * basis and additionaly udpated periodically during a long
+             * running query.
+             */
+            TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us);
+
+            if (entry_age < catalog_cache_prune_min_age)
+                continue;
+
+            /*
+             * Entries that are not accessed after the last pruning are
+             * removed in that seconds, and their lives are prolonged
+             * according to how many times they are accessed up to three times
+             * of the duration. We don't try shrink buckets since pruning
+             * effectively caps catcache expansion in the long term.
+             */
+            if (ct->naccess > 1)
+                ct->naccess--;
+            else
+            {
+                CatCacheRemoveCTup(cp, ct);
+                nremoved++;
+            }
+        }
+    }
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -1263,6 +1346,10 @@ SearchCatCacheInternal(CatCache *cache,
          * near the front of the hashbucket's list.)
          */
         dlist_move_head(bucket, &ct->cache_elem);
+        ct->naccess++;
+        ct->naccess &= 3;
+
+        ct->lastaccess = catcacheclock;
 
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
@@ -1888,6 +1975,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -1895,11 +1984,25 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     CacheHdr->ch_ntup++;
 
     /*
-     * If the hash table has become too full, enlarge the buckets array. Quite
-     * arbitrarily, we enlarge when fill factor > 2.
-     */
-    if (cache->cc_ntup > cache->cc_nbuckets * 2)
-        RehashCatCache(cache);
+     * If the hash table has become too full, try removing infrequently used
+     * entries to make a room for the new entry. If failed, enlarge the bucket
+     * array instead.  Quite arbitrarily, we try this when fill factor > 2.
+      */
+    if (unlikely(cache->cc_ntup > cache->cc_nbuckets * 2))
+    {
+        bool rehash = true;
+
+        if (unlikely(catalog_cache_prune_min_age > 0))
+        {
+            /* increase refcount so that the new entry survives pruning */
+            ct->refcount++;
+            rehash = !CatCacheCleanupOldEntries(cache);
+            ct->refcount--;
+        }
+
+        if (likely(rehash))
+            RehashCatCache(cache);
+    } 
 
     return ct;
 }
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 65d816a583..871f51fe34 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -119,6 +120,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    int            naccess;        /* # of access to this entry, up to 2  */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -185,6 +188,18 @@ typedef struct catcacheheader
     int            ch_ntup;        /* # of tuples in all caches */
 } CatCacheHeader;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
 
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;

RE: Protect syscache from bloating with negative cache entries

От
"Ideriha, Takeshi"
Дата:
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com]
>Does this result show that hard-limit size option with memory accounting doesn't harm
>to usual users who disable hard limit size option?

Hi, 

I've implemented relation cache size limitation with LRU list and built-in memory context size account.
And I'll share some results coupled with a quick recap of catcache so that we can resume discussion if needed
though relation cache bloat was also discussed in this thread but right now it's pending 
and catcache feature is not fixed. But a variety of information could be good I believe.

Regarding catcache it seems to me recent Horiguchi san posts shows a pretty detailed stats
including comparison LRU overhead and full scan of hash table. According to the result, lru overhead seems small
but for simplicity this thread go without LRU.
https://www.postgresql.org/message-id/20190404.215255.09756748.horiguchi.kyotaro%40lab.ntt.co.jp

When there was hard limit of catcach, there was built-in memory context size accounting machinery.
I checked the overhead of memory accounting, and when repeating palloc and pfree of 800 byte area many times it was 4%
down
on the other hand in case of 32768 byte there seems no overhead.
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F44564E%40G01JPEXMBKW04 


Regarding relcache hard limit (relation_cache_max_size), most of the architecture was similar as catcache one with LRU
listexcept memory accounting.
 
Relcaches are managed by LRU list. To prune LRU cache, we need to know overall relcache sizes including objects pointed
byrelcache 
 
like 'index info'.
So in this patch relcache objects are allocated under RelCacheMemoryContext, which is child of CacheMemoryContext.
Objectspointed by
 
relcache is allocated under child context of RelCacheMemoryContext.
In built-in size accounting, if memoryContext is set to collect "group(family) size", you can get context size
includingchild easily.
 

I ran two experiments:
A) One is pgbench using Tomas's script he posted while ago, which is randomly select 1 from many tables.
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F426207%40G01JPEXMBKW04

B) The other is to check memory context account overhead using the same method.
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F44564E%40G01JPEXMBKW04 

A) randomly select 1 from many tables
Results are average of 5 times each.

number of tables           | 100  |1000    |10000
-----------------------------------------------------------
TPS (master)             |11105  |10815 |8915
TPS (patch; limit feature off)  |11254 (+1%) |11176 (+3%) |9242 (+4%)
TPS (patch: limit on with 1MB)    |11317 (+2%) |10491 (-3%) |7380 (-17%)

The results are noisy but it seems overhead of LRU and memory accounting is small when turning off the relcache limit
feature.
When turning on the limit feature, after exceeding the limit it drops 17%, which is no surprise.


B) Repeat palloc/pfree
"With group accounting" means that account test context and its child context with built-in accounting using
"palloc_bench_family()".
The other one is that using palloc_bench(). Please see palloc_bench.gz.

[Size=32768, iter=1,000,000]
Master               | 59.97 ms
Master with group account | 59.57 ms
patched              |67.23 ms
patched with family       |68.81 ms

It seems that overhead seems large in this patch. So it needs more inspection this area.


regards,
Takeshi Ideriha



Вложения

Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
Hello, 

my_gripe> But, still fluctulates by around 5%..
my_gripe> 
my_gripe> If this level of the degradation is still not acceptable, that
my_gripe> means that nothing can be inserted in the code path and the new
my_gripe> code path should be isolated from existing code by using indirect
my_gripe> call.

Finally, after some struggling, I think I could manage to measure
the impact on performace precisely and reliably. Starting from
"make distclean" every time building, then removing all in
$TARGET before installation makes things stable enough. (I don't
think it's good but I didin't investigate the cause..)

I measured time/call by directly calling SearchSysCache3() many
times. It showed that the patch causes around 0.1 microsec
degradation per call. (The funtion overall took about 6.9
microsec on average.)

Next, I counted how many times SearchSysCache is called during a
planning with, as an instance, a query on a partitioned table
having 3000 columns and 1000 partitions.

  explain analyze select sum(c0000) from test.p;

Planner made 6020608 times syscache calls while planning and the
overall planning time was 8641ms. (Exec time was 48ms.) 6020608
times 0.1 us is 602 ms of degradation. So roughly -7% degradation
in planning time in estimation. The degradation was given by
really only the two successive instructions "ADD/conditional
MOVE(CMOVE)". The fact leads to the conclusion that the existing
code path as is doesn't have room for any additional code.


So I sought for room at least for one branch and found that (on
gcc 7.3.1/CentOS7/x64). Interestingly, de-inlining
SearchCatCacheInternal gave me gain on performance by about
3%. Further inlining of CatalogCacheComputeHashValue() gave
another gain about 3%. I could add a branch in
SearchCatCacheInteral within the gain.

I also tried indirect calls but the degradation overwhelmed the
gain, so I choosed branching rather than indirect calls. I didn't
investigated how it happens.


The following is the result. The binaries are build with the same
configuration using -O2.

binary means
  master      : master HEAD.
  patched_off : patched, but pruning disabled (catalog_cache_prune_min_age=-1).
  patched_on  : patched with pruning enabled.
                ("300s" for 1, "1s" for2, "0" for 3)

bench:
  1: corresponds to catcachebench(1); fetching STATRELATTINH 3000
     * 1000 times generating new cache entriies. (Massive cache
       creatiion)
     Pruning doesn't happen while running this.

  2: catcachebench(2); 60000 times cache access on 1000
     STATRELATTINH entries. (Frequent cache reference)
     Pruning doesn't happen while running this.

  3: catcachebench(3); fetching 1000(tbls) * 3000(cols)

     STATRELATTINH entries. Catcache clock advancing with the
     interval of 100(tbls) * 3000(cols) times of access and
     pruning happenshoge.

     While running catcachebench(3) once, pruning happens 28
     times and most of the time 202202 entries are removed and
     the total number of entries was limite to 524289. (The
     systable has 3000 * 1001 = 3003000 tuples.)

iter: Number of iterations. Time ms and stddev is calculated over
     the iterations.


    binar    | bench | iter  |  time ms | stddev
-------------+-------+-------+----------+--------
 master      | 1     |    10 |  8150.30 |  12.96
 master      | 2     |    10 |  4002.88 |  16.18
 master      | 3     |    10 |  9065.06 |  11.46
-------------+-------+-------+----------+--------
 patched_off | 1     |    10 |  8090.95 |   9.95
 patched_off | 2     |    10 |  3984.67 |  12.33
 patched_off | 3     |    10 |  9050.46 |   4.64
-------------+-------+-------+----------+--------
 patched_on  | 1     |    10 |  8158.95 |   6.29
 patched_on  | 2     |    10 |  4023.72 |  10.41
 patched_on  | 3     |    10 | 16532.66 |  18.39

patched_off is slightly faster than master. patched_on is
generally a bit slower. Even though patched_on/3 seems take too
long time, the extra time comes from increased catalog table
acess in exchange of memory saving. (That is, it is expected
behavior.) I ran it several times and most them showed the same
tendency.

As a side-effect, once the branch added, the shared syscache in a
neighbour thread will be able to be inserted together without
impact on existing code path.


===
The benchmark script is used as the follows:

- create many (3000, as example) tables in "test" schema. I
  created a partitioned table with 3000 children.

- The tables have many columns, 1000 for me.

- Run the following commands.

  =# select catcachebench(0);  -- warm up systables.
  =# set catalog_cache_prune_min_age = any; -- as required
  =# select catcachebench(n);  -- 3 >= n >= 1, the number of "bench" above.


  The above result is taked with the following query.

  =# select 'patched_on', '3' , count(a), avg(a)::numeric(10,2), stddev(a)::numeric(10,2) from (select catcachebench(3)
fromgenerate_series(1, 10)) as a(a);
 

====
The attached patches are:

0001-Adjust-inlining-of-some-functions.patch:

 Changes inlining property of two functions,
 SearchCatCacheInternal and CatalogCacheComputeHashValue.

0002-Benchmark-extension-and-required-core-change.patch:

 Micro benchmark of SearchSysCache3() and core-side tweaks, which
 is out-of this patch set in the view of functionality. Works for
 0001 but not for 0004 or later. 0003 adjusts that.

0003-Adjust-catcachebench-for-later-patches.patch

 Adjustment of 0002, benchmark for 0004, the body of this
 patchset. Breaks code consistency until 0004 applied.

0004-Catcache-pruning-feature.patch

 The feature patch, intentionally unchanges indentation of an
 existing code block in SearchCatCacheInternal for smaller size
 of the patch. It is adjusted in the next 0005 patch.

0005-Adjust-indentation-of-SearchCatCacheInternal.patch

 Adjusts indentation of 0004.


0001+4+5 is the final shape of the patch set and 0002+3 is only
for benchmarking.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 927ce9035e13240378c7c332610bdde9377c2d7b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 28 Jun 2019 16:29:52 +0900
Subject: [PATCH 1/5] Adjust inlining of some functions

SearchCatCacheInternal code path is quite short and hot so that it
doesn't accept additional cycles in the function. But changing inline
attribute of SearchCatCacheInternal and CatalogCacheComputeHashValue
makes SearchCatCacheN faster by about 6%. This makes room for an extra
branch to be the door to other implementations of catcache.
---
 src/backend/utils/cache/catcache.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 00def27881..8fc067ce31 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -63,10 +63,10 @@
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
-static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
-                                               int nkeys,
-                                               Datum v1, Datum v2,
-                                               Datum v3, Datum v4);
+static HeapTuple SearchCatCacheInternal(CatCache *cache,
+                                        int nkeys,
+                                        Datum v1, Datum v2,
+                                        Datum v3, Datum v4);
 
 static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 int nkeys,
@@ -75,8 +75,9 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 Datum v1, Datum v2,
                                                 Datum v3, Datum v4);
 
-static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
-                                           Datum v1, Datum v2, Datum v3, Datum v4);
+static inline uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
+                                                  Datum v1, Datum v2,
+                                                  Datum v3, Datum v4);
 static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys,
                                                 HeapTuple tuple);
 static inline bool CatalogCacheCompareTuple(const CatCache *cache, int nkeys,
@@ -266,7 +267,7 @@ GetCCHashEqFuncs(Oid keytype, CCHashFN *hashfunc, RegProcedure *eqfunc, CCFastEq
  *
  * Compute the hash value associated with a given set of lookup keys
  */
-static uint32
+static inline uint32
 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                              Datum v1, Datum v2, Datum v3, Datum v4)
 {
@@ -1194,7 +1195,7 @@ SearchCatCache4(CatCache *cache,
 /*
  * Work-horse for SearchCatCache/SearchCatCacheN.
  */
-static inline HeapTuple
+static HeapTuple
 SearchCatCacheInternal(CatCache *cache,
                        int nkeys,
                        Datum v1,
-- 
2.16.3

From f0f5833ddfd0aac934cc7b5ded93541810c486d3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 28 Jun 2019 17:03:07 +0900
Subject: [PATCH 2/5] Benchmark extension and required core change

Micro benchmark extension for SearchSysCache and required core-side
code.
---
 contrib/catcachebench/Makefile               |  17 ++
 contrib/catcachebench/catcachebench--0.0.sql |   9 +
 contrib/catcachebench/catcachebench.c        | 281 +++++++++++++++++++++++++++
 contrib/catcachebench/catcachebench.control  |   6 +
 src/backend/utils/cache/catcache.c           |  13 ++
 5 files changed, 326 insertions(+)
 create mode 100644 contrib/catcachebench/Makefile
 create mode 100644 contrib/catcachebench/catcachebench--0.0.sql
 create mode 100644 contrib/catcachebench/catcachebench.c
 create mode 100644 contrib/catcachebench/catcachebench.control

diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile
new file mode 100644
index 0000000000..0478818b25
--- /dev/null
+++ b/contrib/catcachebench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = catcachebench
+OBJS = catcachebench.o
+
+EXTENSION = catcachebench
+DATA = catcachebench--0.0.sql
+PGFILEDESC = "catcachebench - benchmark for catcache pruning feature"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/catcachebench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql
new file mode 100644
index 0000000000..e091baaaa7
--- /dev/null
+++ b/contrib/catcachebench/catcachebench--0.0.sql
@@ -0,0 +1,9 @@
+/* contrib/catcachebench/catcachebench--0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit
+
+CREATE FUNCTION catcachebench(IN type int)
+RETURNS double precision
+AS 'MODULE_PATHNAME', 'catcachebench'
+LANGUAGE C STRICT VOLATILE;
diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
new file mode 100644
index 0000000000..0cebbbde4f
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.c
@@ -0,0 +1,281 @@
+/*
+ * catcachebench: test code for cache pruning feature
+ */
+#include "postgres.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+
+Oid        tableoids[10000];
+int        ntables = 0;
+int16    attnums[1000];
+int        natts = 0;
+
+PG_MODULE_MAGIC;
+
+double catcachebench1(void);
+double catcachebench2(void);
+double catcachebench3(void);
+void collectinfo(void);
+void catcachewarmup(void);
+
+PG_FUNCTION_INFO_V1(catcachebench);
+
+Datum
+catcachebench(PG_FUNCTION_ARGS)
+{
+    int        testtype = PG_GETARG_INT32(0);
+    double    ms;
+    extern bool _catcache_shrink_buckets;
+
+    collectinfo();
+
+    /* flush the catalog -- safe? don't mind. */
+    
+    _catcache_shrink_buckets = true;
+    CatalogCacheFlushCatalog(StatisticRelationId);
+    _catcache_shrink_buckets = false;
+
+    switch (testtype)
+    {
+    case 0:
+        catcachewarmup(); /* prewarm of syscatalog */
+        PG_RETURN_NULL();
+    case 1:
+        ms = catcachebench1(); break;
+    case 2:
+        ms = catcachebench2(); break;
+    case 3:
+        ms = catcachebench3(); break;
+    default:
+        elog(ERROR, "Invalid test type: %d", testtype);
+    }
+
+    PG_RETURN_DATUM(Float8GetDatum(ms));
+}
+
+/*
+ * fetch all attribute entires of all tables.
+ */
+double
+catcachebench1(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of a table 6000 times.
+ */
+double
+catcachebench2(void)
+{
+    const int clock_step = 100;
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < 60000 ; t++)
+    {
+        int ct = clock_step;
+
+        /*
+         * catcacheclock is updated by transaction timestamp, so needs to
+         * be updated by other means for this test to work. Here I choosed
+         * to update the clock every 100 times of table scans.
+         */
+        if (--ct < 0)
+        {
+            // We don't have it yet.
+            //SetCatCacheClock(GetCurrentTimestamp());
+            GetCurrentTimestamp();
+            ct = clock_step;
+        }
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[0]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of all tables twice with having expiration
+ * happen.
+ */
+double
+catcachebench3(void)
+{
+    const int clock_step = 100;
+    int i, t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (i = 0 ; i < 2 ; i++)
+    {
+        int ct = clock_step;
+
+        for (t = 0 ; t < ntables ; t++)
+        {
+            /*
+             * catcacheclock is updated by transaction timestamp, so needs to
+             * be updated by other means for this test to work. Here I choosed
+             * to update the clock every 100 tables scan.
+             */
+            if (--ct < 0)
+            {
+                // We don't have it yet.
+                //SetCatCacheClock(GetCurrentTimestamp());
+                GetCurrentTimestamp();
+                ct = clock_step;
+            }
+            for (a = 0 ; a < natts ; a++)
+            {
+                HeapTuple tup;
+
+                tup = SearchSysCache3(STATRELATTINH,
+                                      ObjectIdGetDatum(tableoids[t]),
+                                      Int16GetDatum(attnums[a]),
+                                      BoolGetDatum(false));
+                /* should be null, but.. */
+                if (HeapTupleIsValid(tup))
+                    ReleaseSysCache(tup);
+            }
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+void
+catcachewarmup(void)
+{
+    int t, a;
+
+    /* load up catalog tables */
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+}
+
+void
+collectinfo(void)
+{
+    int ret;
+    Datum    values[10000];
+    bool    nulls[10000];
+    Oid        types0[] = {OIDOID};
+    int i;
+
+    ntables = 0;
+    natts = 0;
+
+    SPI_connect();
+    /* collect target tables */
+    ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname =
\'test\')",
+                      true, 0);
+    if (ret != SPI_OK_SELECT)
+        elog(ERROR, "Failed 1");
+    if (SPI_processed == 0)
+        elog(ERROR, "no relation found in schema \"test\"");
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in schema \"test\"");
+
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 2");
+
+        tableoids[ntables++] = DatumGetObjectId(values[0]);
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d tables found", ntables);
+
+    values[0] = ObjectIdGetDatum(tableoids[0]);
+    nulls[0] = false;
+    SPI_connect();
+    ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid
=$1)",
 
+                                1, types0, values, NULL, true, 0);
+    if (SPI_processed == 0)
+        elog(ERROR, "no attribute found in table %d", tableoids[0]);
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in table %d", tableoids[0]);
+    
+    /* collect target attributes. assuming all tables have the same attnums */
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        int16 attnum;
+
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 3");
+        attnum = DatumGetInt16(values[0]);
+
+        if (attnum > 0)
+            attnums[natts++] = attnum;
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d attributes found", natts);
+}
diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control
new file mode 100644
index 0000000000..3fc9d2e420
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.control
@@ -0,0 +1,6 @@
+# catcachebench
+
+comment = 'benchmark for catcache pruning'
+default_version = '0.0'
+module_pathname = '$libdir/catcachebench'
+relocatable = true
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 8fc067ce31..98427b67cd 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -716,6 +716,9 @@ ResetCatalogCaches(void)
  *    rather than relying on the relcache to keep a tupdesc for us.  Of course
  *    this assumes the tupdesc of a cachable system table will not change...)
  */
+/* CODE FOR catcachebench: REMOVE ME AFTER USE */
+bool _catcache_shrink_buckets = false;
+/* END: CODE FOR catcachebench*/
 void
 CatalogCacheFlushCatalog(Oid catId)
 {
@@ -735,6 +738,16 @@ CatalogCacheFlushCatalog(Oid catId)
 
             /* Tell inval.c to call syscache callbacks for this cache */
             CallSyscacheCallbacks(cache->id, 0);
+
+            /* CODE FOR catcachebench: REMOVE ME AFTER USE */
+            if (_catcache_shrink_buckets)
+            {
+                cache->cc_nbuckets = 128;
+                pfree(cache->cc_bucket);
+                cache->cc_bucket = palloc0(128 * sizeof(dlist_head));
+                elog(LOG, "Catcache reset");
+            }
+            /* END: CODE FOR catcachebench*/
         }
     }
 
-- 
2.16.3

From fcd0273933f3f53979749ba2e315043f1a6f6f31 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 1 Jul 2019 15:08:11 +0900
Subject: [PATCH 3/5] Adjust catcachebench for later patches

Make the benchmark use SetCatCacheClock, which is being introduced by
the next patch. This temprarily breaks consistency until the next
patch is applied.
---
 contrib/catcachebench/catcachebench.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
index 0cebbbde4f..63a7400463 100644
--- a/contrib/catcachebench/catcachebench.c
+++ b/contrib/catcachebench/catcachebench.c
@@ -116,9 +116,7 @@ catcachebench2(void)
          */
         if (--ct < 0)
         {
-            // We don't have it yet.
-            //SetCatCacheClock(GetCurrentTimestamp());
-            GetCurrentTimestamp();
+            SetCatCacheClock(GetCurrentTimestamp());
             ct = clock_step;
         }
         for (a = 0 ; a < natts ; a++)
@@ -168,9 +166,7 @@ catcachebench3(void)
              */
             if (--ct < 0)
             {
-                // We don't have it yet.
-                //SetCatCacheClock(GetCurrentTimestamp());
-                GetCurrentTimestamp();
+                SetCatCacheClock(GetCurrentTimestamp());
                 ct = clock_step;
             }
             for (a = 0 ; a < natts ; a++)
-- 
2.16.3

From 57129a5a001b7729f8888579bc11e47cf7192801 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 1 Jul 2019 11:31:54 +0900
Subject: [PATCH 4/5] Catcache pruning feature.

Currently we don't have a mechanism to limit the memory amount for
syscache. Syscache bloat often causes process die by OOM killer or
other problems. This patch lets old syscache entries removed to
eventually limit the amount of cache.

This patch intentionally unchanges indentation of an existing code
block in SearchCatCacheInternal for the patch size to be smaller. It
is adjusted in the next patch.
---
 src/backend/utils/cache/catcache.c | 186 +++++++++++++++++++++++++++++++++++++
 src/backend/utils/misc/guc.c       |  12 +++
 src/include/utils/catcache.h       |  17 ++++
 3 files changed, 215 insertions(+)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 98427b67cd..b552ae960c 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -60,9 +60,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = -1;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static HeapTuple SearchCatCacheInternal(CatCache *cache,
                                         int nkeys,
                                         Datum v1, Datum v2,
@@ -864,9 +873,107 @@ InitCatCache(int id,
      */
     MemoryContextSwitchTo(oldcxt);
 
+    /* initialize catcache reference clock if haven't done yet */
+    if (catcacheclock == 0)
+        catcacheclock = GetCurrentTimestamp();
+
+    /*
+     * This cache doesn't contain a tuple older than the current time. Prevent
+     * the first pruning from happening too early.
+     */
+    cp->cc_oldest_ts = catcacheclock;
+
     return cp;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int        nremoved = 0;
+    int        i;
+    long    oldest_ts = catcacheclock;
+    long    age;
+    int        us;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age < 0)
+        return false;
+
+    /* Don't scan the hash when we know we don't have prunable entries */
+    TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us);
+    if (age < catalog_cache_prune_min_age)
+        return false;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount == 0 &&
+                (ct->c_list == NULL || ct->c_list->refcount == 0))
+            {
+                /*
+                 * Calculate the duration from the time from the last access
+                 * to the "current" time. catcacheclock is updated
+                 * per-statement basis and additionaly udpated periodically
+                 * during a long running query.
+                 */
+                TimestampDifference(ct->lastaccess, catcacheclock, &age, &us);
+
+                if (age > catalog_cache_prune_min_age)
+                {
+                    /*
+                     * Entries that are not accessed after the last pruning
+                     * are removed in that seconds, and their lives are
+                     * prolonged according to how many times they are accessed
+                     * up to three times of the duration. We don't try shrink
+                     * buckets since pruning effectively caps catcache
+                     * expansion in the long term.
+                     */
+                    if (ct->naccess > 2)
+                        ct->naccess = 1;
+                    else if (ct->naccess > 0)
+                        ct->naccess--;
+                    else
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+
+                        /* don't update oldest_ts by removed entry */
+                        continue;
+                    }
+                }
+            }
+
+            /* update oldest timestamp if the entry remains alive */
+            if (ct->lastaccess < oldest_ts)
+                oldest_ts = ct->lastaccess;
+        }
+    }
+
+    cp->cc_oldest_ts = oldest_ts;
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  * Enlarge a catcache, doubling the number of buckets.
  */
@@ -880,6 +987,10 @@ RehashCatCache(CatCache *cp)
     elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets",
          cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets);
 
+    /* try removing old entries before expanding the hash */
+    if (CatCacheCleanupOldEntries(cp))
+        return;
+
     /* Allocate a new, larger, hash table. */
     newnbuckets = cp->cc_nbuckets * 2;
     newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head));
@@ -1257,6 +1368,14 @@ SearchCatCacheInternal(CatCache *cache,
      * dlist within the loop, because we don't continue the loop afterwards.
      */
     bucket = &cache->cc_bucket[hashIndex];
+
+    /*
+     * Even though this branch leads to duplicate of a bit much code, we want
+     * as less branches as possible here to keep fastest when pruning is
+     * disabled. Don't try to move this branch to foreach to save lines.
+     */
+    if (likely(catalog_cache_prune_min_age < 0))
+    {
     dlist_foreach(iter, bucket)
     {
         ct = dlist_container(CatCTup, cache_elem, iter.cur);
@@ -1309,6 +1428,71 @@ SearchCatCacheInternal(CatCache *cache,
             return NULL;
         }
     }
+    }
+    else
+    {
+        /*
+         * We manage the age of each entries for pruning in this branch.
+         */
+        dlist_foreach(iter, bucket)
+        {
+            /* The following section is the same with the if() block */
+            ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            if (ct->dead)
+                continue;
+
+            if (ct->hash_value != hashValue)
+                continue;
+
+            if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments))
+                continue;
+
+            dlist_move_head(bucket, &ct->cache_elem);
+
+            /*
+             * Prolong life of this entry. Since we want run as less
+             * instructions as possible and want the branch be stable for
+             * performance reasons, we don't give a strict cap on the
+             * counter. All numbers above 1 will be regarded as 2 in
+             * CatCacheCleanupOldEntries().
+             */
+            ct->naccess++;
+            if (unlikely(ct->naccess == 0))
+                ct->naccess = 2;
+            ct->lastaccess = catcacheclock;
+
+            /* Following part is also the same with if() block above */
+            if (!ct->negative)
+            {
+                ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner);
+                ct->refcount++;
+                ResourceOwnerRememberCatCacheRef(CurrentResourceOwner,
+                                                 &ct->tuple);
+
+                CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
+                           cache->cc_relname, hashIndex);
+                
+#ifdef CATCACHE_STATS
+                cache->cc_hits++;
+#endif
+                
+
+                return &ct->tuple;
+            }
+            else
+            {
+                CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
+                           cache->cc_relname, hashIndex);
+
+#ifdef CATCACHE_STATS
+                cache->cc_neg_hits++;
+#endif
+
+                return NULL;
+            }
+        }
+    }
 
     return SearchCatCacheMiss(cache, nkeys, hashValue, hashIndex, v1, v2, v3, v4);
 }
@@ -1902,6 +2086,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 92c4fee8f8..c2a4caa44b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,7 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/memutils.h"
@@ -2252,6 +2253,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index ff1fabaca1..ad962fb096 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    TimestampTz    cc_oldest_ts;    /* timestamp of the oldest tuple in the hash */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    unsigned int naccess;        /* # of access to this entry */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +193,19 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.16.3

From 772d7fda030e8990fbd84ec44f07120d73682256 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 1 Jul 2019 14:11:08 +0900
Subject: [PATCH 5/5] Adjust indentation of SearchCatCacheInternal

The previous patch leaves indentation of an existing code block in
SearchCatCacheInternal unchanged for diff size to be small. This
adjusts the indentation.
---
 src/backend/utils/cache/catcache.c | 82 +++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index b552ae960c..60d0fd28a8 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1376,59 +1376,59 @@ SearchCatCacheInternal(CatCache *cache,
      */
     if (likely(catalog_cache_prune_min_age < 0))
     {
-    dlist_foreach(iter, bucket)
-    {
-        ct = dlist_container(CatCTup, cache_elem, iter.cur);
-
-        if (ct->dead)
-            continue;            /* ignore dead entries */
-
-        if (ct->hash_value != hashValue)
-            continue;            /* quickly skip entry if wrong hash val */
-
-        if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments))
-            continue;
-
-        /*
-         * We found a match in the cache.  Move it to the front of the list
-         * for its hashbucket, in order to speed subsequent searches.  (The
-         * most frequently accessed elements in any hashbucket will tend to be
-         * near the front of the hashbucket's list.)
-         */
-        dlist_move_head(bucket, &ct->cache_elem);
-
-        /*
-         * If it's a positive entry, bump its refcount and return it. If it's
-         * negative, we can report failure to the caller.
-         */
-        if (!ct->negative)
+        dlist_foreach(iter, bucket)
         {
-            ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner);
-            ct->refcount++;
-            ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, &ct->tuple);
+            ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
-            CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
-                       cache->cc_relname, hashIndex);
+            if (ct->dead)
+                continue;            /* ignore dead entries */
+
+            if (ct->hash_value != hashValue)
+                continue;            /* quickly skip entry if wrong hash val */
+
+            if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments))
+                continue;
+
+            /*
+             * We found a match in the cache.  Move it to the front of the
+             * list for its hashbucket, in order to speed subsequent searches.
+             * (The most frequently accessed elements in any hashbucket will
+             * tend to be near the front of the hashbucket's list.)
+             */
+            dlist_move_head(bucket, &ct->cache_elem);
+
+            /*
+             * If it's a positive entry, bump its refcount and return it. If
+             * it's negative, we can report failure to the caller.
+             */
+            if (!ct->negative)
+            {
+                ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner);
+                ct->refcount++;
+                ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, &ct->tuple);
+
+                CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
+                           cache->cc_relname, hashIndex);
 
 #ifdef CATCACHE_STATS
-            cache->cc_hits++;
+                cache->cc_hits++;
 #endif
 
-            return &ct->tuple;
-        }
-        else
-        {
-            CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
-                       cache->cc_relname, hashIndex);
+                return &ct->tuple;
+            }
+            else
+            {
+                CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
+                           cache->cc_relname, hashIndex);
 
 #ifdef CATCACHE_STATS
-            cache->cc_neg_hits++;
+                cache->cc_neg_hits++;
 #endif
 
-            return NULL;
+                return NULL;
+            }
         }
     }
-    }
     else
     {
         /*
-- 
2.16.3


Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
I'd like to throw in food for discussion on how much SearchSysCacheN
suffers degradation from some choices on how we can insert a code into
the SearchSysCacheN code path.

I ran the run2.sh script attached, which runs catcachebench2(), which
asks SearchSysCache3() for cached entries (almost) 240000 times per
run.  The number of each output line is the mean of 3 times runs, and
stddev. Lines are in "time" order and edited to fit here. "gen_tbl.pl
| psql" creates a database for the benchmark. catcachebench2() runs
the shortest path in the three in the attached benchmark program.

(pg_ctl start)
$ perl gen_tbl.pl | psql ...
(pg_ctl stop)


0. Baseline (0001-benchmark.patch, 0002-Base-change.patch)

At first, I made two binaries from the literally same source. For the
benchmark's sake the source is already modified a bit. Specifically it
has SetCatCacheClock needed by the benchmark, but actually not called
in this benchmark.


              time(ms)|stddev(ms)
not patched | 7750.42 |  23.83   # 0.6% faster than 7775.23
not patched | 7864.73 |  43.21
not patched | 7866.80 | 106.47
not patched | 7952.06 |  63.14
master      | 7775.23 |  35.76
master      | 7870.42 | 120.31
master      | 7876.76 | 109.04
master      | 7963.04 |   9.49

So, it seems to me that we cannot tell something about differences
below about 80ms (about 1%) now.


1. Inserting a branch in SearchCatCacheInternal. (CatCache_Pattern_1.patch)

 This is the most straightforward way to add an alternative feature.

pattern 1 | 8459.73 |  28.15  # 9% (>> 1%) slower than 7757.58
pattern 1 | 8504.83 |  55.61
pattern 1 | 8541.81 |  41.56
pattern 1 | 8552.20 |  27.99
master    | 7757.58 |  22.65
master    | 7801.32 |  20.64
master    | 7839.57 |  25.28
master    | 7925.30 |  38.84

 It's so slow that it cannot be used.


2. Making SearchCatCacheInternal be an indirect function.
   (CatCache_Pattern_2.patch)

Next, I made the work horse routine be called indirectly. The "inline"
for the function acutally let compiler optimize SearchCatCacheN
routines as described in comment but the effect doesn't seem so large
at least for this case.

pattern 2 | 7976.22 |  46.12  (2.6% slower > 1%)
pattern 2 | 8103.03 |  51.57
pattern 2 | 8144.97 |  68.46
pattern 2 | 8353.10 |  34.89
master    | 7768.40 |  56.00
master    | 7772.02 |  29.05
master    | 7775.05 |  27.69
master    | 7830.82 |  13.78


3. Making SearchCatCacheN be indirect functions. (CatCache_Pattern_3.patch)

As far as gcc/linux/x86 goes, SearchSysCacheN is comiled into the
following instructions:

 0x0000000000866c20 <+0>:    movslq %edi,%rdi
 0x0000000000866c23 <+3>:    mov    0xd3da40(,%rdi,8),%rdi
 0x0000000000866c2b <+11>:    jmpq   0x856ee0 <SearchCatCache3>

If we made SearchCatCacheN be indirect functions as the patch, it
changes just one instruction as:

 0x0000000000866c50 <+0>:    movslq %edi,%rdi
 0x0000000000866c53 <+3>:    mov    0xd3da60(,%rdi,8),%rdi
 0x0000000000866c5b <+11>:    jmpq   *0x4c0caf(%rip) # 0xd27910 <SearchCatCache3>

pattern 3 | 7836.26 |  48.66 (2% slower > 1%)
pattern 3 | 7963.74 |  67.88
pattern 3 | 7966.65 | 101.07
pattern 3 | 8214.57 |  71.93
master    | 7679.74 |  62.20
master    | 7756.14 |  77.19
master    | 7867.14 |  73.33
master    | 7893.97 |  47.67

I expected this runs in almost the same time. I'm not sure if it is
the result of spectre_v2 mitigation, but I show status of my
environment as follows.


# uname -r
4.18.0-80.11.2.el8_0.x86_64
# cat /proc/cpuinfo
...
model name      : Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
stepping        : 12
microcode       : 0xae
bugs            : spectre_v1 spectre_v2 spec_store_bypass mds
# cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: disabled, RSB filling


I am using CentOS8 and I don't find a handy (or on-the-fly) way to
disable them..

Attached are:

0001-benchmark.patch    : catcache benchmark extension (and core side fix)
0002-Base-change.patch  : baseline change in this series of benchmark
CatCache_Pattern_1.patch: naive branching
CatCache_Pattern_2.patch: indirect SearchCatCacheInternal
CatCache_Pattern_1.patch: indirect SearchCatCacheN

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 245e88e1b43df74273fbaa1b22f4f64621ffe9d5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 14 Nov 2019 19:24:36 +0900
Subject: [PATCH 1/2] benchmark

---
 contrib/catcachebench/Makefile               |  17 +
 contrib/catcachebench/catcachebench--0.0.sql |  14 +
 contrib/catcachebench/catcachebench.c        | 330 +++++++++++++++++++
 contrib/catcachebench/catcachebench.control  |   6 +
 src/backend/utils/cache/catcache.c           |  33 ++
 src/backend/utils/cache/syscache.c           |   2 +-
 6 files changed, 401 insertions(+), 1 deletion(-)
 create mode 100644 contrib/catcachebench/Makefile
 create mode 100644 contrib/catcachebench/catcachebench--0.0.sql
 create mode 100644 contrib/catcachebench/catcachebench.c
 create mode 100644 contrib/catcachebench/catcachebench.control

diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile
new file mode 100644
index 0000000000..0478818b25
--- /dev/null
+++ b/contrib/catcachebench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = catcachebench
+OBJS = catcachebench.o
+
+EXTENSION = catcachebench
+DATA = catcachebench--0.0.sql
+PGFILEDESC = "catcachebench - benchmark for catcache pruning feature"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/catcachebench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql
new file mode 100644
index 0000000000..ea9cd62abb
--- /dev/null
+++ b/contrib/catcachebench/catcachebench--0.0.sql
@@ -0,0 +1,14 @@
+/* contrib/catcachebench/catcachebench--0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit
+
+CREATE FUNCTION catcachebench(IN type int)
+RETURNS double precision
+AS 'MODULE_PATHNAME', 'catcachebench'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits
bigint)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'catcachereadstats'
+LANGUAGE C STRICT VOLATILE;
diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
new file mode 100644
index 0000000000..b5a4d794ed
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.c
@@ -0,0 +1,330 @@
+/*
+ * catcachebench: test code for cache pruning feature
+ */
+/* #define CATCACHE_STATS */
+#include "postgres.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+
+Oid        tableoids[10000];
+int        ntables = 0;
+int16    attnums[1000];
+int        natts = 0;
+
+PG_MODULE_MAGIC;
+
+double catcachebench1(void);
+double catcachebench2(void);
+double catcachebench3(void);
+void collectinfo(void);
+void catcachewarmup(void);
+
+PG_FUNCTION_INFO_V1(catcachebench);
+PG_FUNCTION_INFO_V1(catcachereadstats);
+
+extern void CatalogCacheFlushCatalog2(Oid catId);
+extern int64 catcache_called;
+extern CatCache *SysCache[];
+
+typedef struct catcachestatsstate
+{
+    TupleDesc tupd;
+    int          catId;
+} catcachestatsstate;
+
+Datum
+catcachereadstats(PG_FUNCTION_ARGS)
+{
+    catcachestatsstate *state_data = NULL;
+    FuncCallContext *fctx;
+
+    if (SRF_IS_FIRSTCALL())
+    {
+        TupleDesc    tupdesc;
+        MemoryContext mctx;
+
+        fctx = SRF_FIRSTCALL_INIT();
+        mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+        state_data = palloc(sizeof(catcachestatsstate));
+
+        /* Build a tuple descriptor for our result type */
+        if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+            elog(ERROR, "return type must be a row type");
+
+        state_data->tupd = tupdesc;
+        state_data->catId = 0;
+
+        fctx->user_fctx = state_data;
+
+        MemoryContextSwitchTo(mctx);
+    }
+
+    fctx = SRF_PERCALL_SETUP();
+    state_data = fctx->user_fctx;
+
+    if (state_data->catId < SysCacheSize)
+    {
+        Datum    values[5];
+        bool    nulls[5];
+        HeapTuple    resulttup;
+        Datum    result;
+        int        catId = state_data->catId++;
+
+        memset(nulls, 0, sizeof(nulls));
+        memset(values, 0, sizeof(values));
+        values[0] = Int16GetDatum(catId);
+        values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid);
+#ifdef CATCACHE_STATS        
+        values[2] = Int64GetDatum(SysCache[catId]->cc_searches);
+        values[3] = Int64GetDatum(SysCache[catId]->cc_hits);
+        values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits);
+#endif
+        resulttup = heap_form_tuple(state_data->tupd, values, nulls);
+        result = HeapTupleGetDatum(resulttup);
+
+        SRF_RETURN_NEXT(fctx, result);
+    }
+
+    SRF_RETURN_DONE(fctx);
+}
+
+Datum
+catcachebench(PG_FUNCTION_ARGS)
+{
+    int        testtype = PG_GETARG_INT32(0);
+    double    ms;
+
+    collectinfo();
+
+    /* flush the catalog -- safe? don't mind. */
+    CatalogCacheFlushCatalog2(StatisticRelationId);
+
+    switch (testtype)
+    {
+    case 0:
+        catcachewarmup(); /* prewarm of syscatalog */
+        PG_RETURN_NULL();
+    case 1:
+        ms = catcachebench1(); break;
+    case 2:
+        ms = catcachebench2(); break;
+    case 3:
+        ms = catcachebench3(); break;
+    default:
+        elog(ERROR, "Invalid test type: %d", testtype);
+    }
+
+    PG_RETURN_DATUM(Float8GetDatum(ms));
+}
+
+/*
+ * fetch all attribute entires of all tables.
+ */
+double
+catcachebench1(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of a table 6000 times.
+ */
+double
+catcachebench2(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < 240000 ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[0]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of all tables twice with having expiration
+ * happen.
+ */
+double
+catcachebench3(void)
+{
+    const int clock_step = 1000;
+    int i, t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (i = 0 ; i < 4 ; i++)
+    {
+        int ct = clock_step;
+
+        for (t = 0 ; t < ntables ; t++)
+        {
+            /*
+             * catcacheclock is updated by transaction timestamp, so needs to
+             * be updated by other means for this test to work. Here I choosed
+             * to update the clock every 1000 tables scan.
+             */
+            if (--ct < 0)
+            {
+                SetCatCacheClock(GetCurrentTimestamp());
+                ct = clock_step;
+            }
+            for (a = 0 ; a < natts ; a++)
+            {
+                HeapTuple tup;
+
+                tup = SearchSysCache3(STATRELATTINH,
+                                      ObjectIdGetDatum(tableoids[t]),
+                                      Int16GetDatum(attnums[a]),
+                                      BoolGetDatum(false));
+                /* should be null, but.. */
+                if (HeapTupleIsValid(tup))
+                    ReleaseSysCache(tup);
+            }
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+void
+catcachewarmup(void)
+{
+    int t, a;
+
+    /* load up catalog tables */
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+}
+
+void
+collectinfo(void)
+{
+    int ret;
+    Datum    values[10000];
+    bool    nulls[10000];
+    Oid        types0[] = {OIDOID};
+    int i;
+
+    ntables = 0;
+    natts = 0;
+
+    SPI_connect();
+    /* collect target tables */
+    ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname =
\'test\')",
+                      true, 0);
+    if (ret != SPI_OK_SELECT)
+        elog(ERROR, "Failed 1");
+    if (SPI_processed == 0)
+        elog(ERROR, "no relation found in schema \"test\"");
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in schema \"test\"");
+
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 2");
+
+        tableoids[ntables++] = DatumGetObjectId(values[0]);
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d tables found", ntables);
+
+    values[0] = ObjectIdGetDatum(tableoids[0]);
+    nulls[0] = false;
+    SPI_connect();
+    ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid
=$1)",
 
+                                1, types0, values, NULL, true, 0);
+    if (SPI_processed == 0)
+        elog(ERROR, "no attribute found in table %d", tableoids[0]);
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in table %d", tableoids[0]);
+    
+    /* collect target attributes. assuming all tables have the same attnums */
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        int16 attnum;
+
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 3");
+        attnum = DatumGetInt16(values[0]);
+
+        if (attnum > 0)
+            attnums[natts++] = attnum;
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d attributes found", natts);
+}
diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control
new file mode 100644
index 0000000000..3fc9d2e420
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.control
@@ -0,0 +1,6 @@
+# catcachebench
+
+comment = 'benchmark for catcache pruning'
+default_version = '0.0'
+module_pathname = '$libdir/catcachebench'
+relocatable = true
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index c3e7d94aa5..2dd8455052 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -740,6 +740,39 @@ CatalogCacheFlushCatalog(Oid catId)
     CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
 }
 
+
+/* FUNCTION FOR BENCHMARKING */
+void
+CatalogCacheFlushCatalog2(Oid catId)
+{
+    slist_iter    iter;
+
+    CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId);
+
+    slist_foreach(iter, &CacheHdr->ch_caches)
+    {
+        CatCache   *cache = slist_container(CatCache, cc_next, iter.cur);
+
+        /* Does this cache store tuples of the target catalog? */
+        if (cache->cc_reloid == catId)
+        {
+            /* Yes, so flush all its contents */
+            ResetCatalogCache(cache);
+
+            /* Tell inval.c to call syscache callbacks for this cache */
+            CallSyscacheCallbacks(cache->id, 0);
+
+            cache->cc_nbuckets = 128;
+            pfree(cache->cc_bucket);
+            cache->cc_bucket = palloc0(128 * sizeof(dlist_head));
+            elog(LOG, "Catcache reset");
+        }
+    }
+
+    CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
+}
+/* END: FUNCTION FOR BENCHMARKING */
+
 /*
  *        InitCatCache
  *
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index d69c0ff813..2e282a10b4 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -983,7 +983,7 @@ static const struct cachedesc cacheinfo[] = {
     }
 };
 
-static CatCache *SysCache[SysCacheSize];
+CatCache *SysCache[SysCacheSize];
 
 static bool CacheInitialized = false;
 
-- 
2.23.0

From eebffb678b2450fbf51395de8c52f4b53a9286d1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 14 Nov 2019 20:28:29 +0900
Subject: [PATCH 2/2] Base change.

---
 src/backend/utils/cache/catcache.c | 19 ++++++++++++++++++-
 src/backend/utils/misc/guc.c       | 13 +++++++++++++
 src/include/utils/catcache.h       | 17 +++++++++++++++++
 3 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 2dd8455052..2dbc2151b1 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -60,9 +60,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                                                int nkeys,
                                                Datum v1, Datum v2,
@@ -99,6 +108,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                              Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    catalog_cache_prune_min_age = newval;
+}
 
 /*
  *                    internal support functions
@@ -765,7 +780,9 @@ CatalogCacheFlushCatalog2(Oid catId)
             cache->cc_nbuckets = 128;
             pfree(cache->cc_bucket);
             cache->cc_bucket = palloc0(128 * sizeof(dlist_head));
-            elog(LOG, "Catcache reset");
+            ereport(DEBUG1,
+                    (errmsg("Catcache reset"),
+                     errhidestmt(true)));
         }
     }
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4b3769b8b0..39a18a8c7a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -82,6 +82,8 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
+#include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
@@ -2257,6 +2259,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index ff1fabaca1..8105f19bc4 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -189,6 +190,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.23.0

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 2dbc2151b1..81ccc0b472 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1273,6 +1273,12 @@ SearchCatCacheInternal(CatCache *cache,
 #ifdef CATCACHE_STATS
     cache->cc_searches++;
 #endif
+    /*  cannot be true, but compiler doesn't know */
+    if (catalog_cache_prune_min_age < -1)
+    {
+        return SearchCatCache(cache, v1, v2, v3, v4); /* Never executed */
+    }
+    
 
     /* Initialize local parameter array */
     arguments[0] = v1;
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 2dbc2151b1..48a8a14c7f 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -72,11 +72,16 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Clock for the last accessed time of a catcache entry. */
 TimestampTz    catcacheclock = 0;
 
-static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
+static HeapTuple SearchCatCacheInternalb(CatCache *cache,
                                                int nkeys,
                                                Datum v1, Datum v2,
                                                Datum v3, Datum v4);
 
+static HeapTuple (*SearchCatCacheInternal)(CatCache *cache,
+                                               int nkeys,
+                                               Datum v1, Datum v2,
+                                               Datum v3, Datum v4) =
+    SearchCatCacheInternalb;
 static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 int nkeys,
                                                 uint32 hashValue,
@@ -1245,7 +1250,7 @@ SearchCatCache4(CatCache *cache,
  * Work-horse for SearchCatCache/SearchCatCacheN.
  */
 static inline HeapTuple
-SearchCatCacheInternal(CatCache *cache,
+SearchCatCacheInternalb(CatCache *cache,
                        int nkeys,
                        Datum v1,
                        Datum v2,
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 2dbc2151b1..e4ebd07397 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -84,6 +84,26 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 Datum v1, Datum v2,
                                                 Datum v3, Datum v4);
 
+static HeapTuple SearchCatCacheb(CatCache *cache,
+                                 Datum v1, Datum v2, Datum v3, Datum v4);
+HeapTuple (*SearchCatCache)(CatCache *cache,
+                            Datum v1, Datum v2, Datum v3, Datum v4) =
+    SearchCatCacheb;
+static HeapTuple SearchCatCache1b(CatCache *cache, Datum v1);
+HeapTuple (*SearchCatCache1)(CatCache *cache, Datum v1) = SearchCatCache1b;
+static HeapTuple SearchCatCache2b(CatCache *cache, Datum v1, Datum v2);
+HeapTuple (*SearchCatCache2)(CatCache *cache, Datum v1, Datum v2) =
+    SearchCatCache2b;
+static HeapTuple SearchCatCache3b(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3);
+HeapTuple (*SearchCatCache3)(CatCache *cache, Datum v1, Datum v2, Datum v3) =
+    SearchCatCache3b;
+static HeapTuple SearchCatCache4b(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3, Datum v4);
+HeapTuple (*SearchCatCache4)(CatCache *cache,
+                             Datum v1, Datum v2, Datum v3, Datum v4) =
+    SearchCatCache4b;
+
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                                            Datum v1, Datum v2, Datum v3, Datum v4);
 static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys,
@@ -1193,8 +1213,8 @@ IndexScanOK(CatCache *cache, ScanKey cur_skey)
  * the caller need not go to the trouble of converting it to a fully
  * null-padded NAME.
  */
-HeapTuple
-SearchCatCache(CatCache *cache,
+static HeapTuple
+SearchCatCacheb(CatCache *cache,
                Datum v1,
                Datum v2,
                Datum v3,
@@ -1210,32 +1230,32 @@ SearchCatCache(CatCache *cache,
  * bit faster than SearchCatCache().
  */
 
-HeapTuple
-SearchCatCache1(CatCache *cache,
+static HeapTuple
+SearchCatCache1b(CatCache *cache,
                 Datum v1)
 {
     return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0);
 }
 
 
-HeapTuple
-SearchCatCache2(CatCache *cache,
+static HeapTuple
+SearchCatCache2b(CatCache *cache,
                 Datum v1, Datum v2)
 {
     return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0);
 }
 
 
-HeapTuple
-SearchCatCache3(CatCache *cache,
+static HeapTuple
+SearchCatCache3b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3)
 {
     return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0);
 }
 
 
-HeapTuple
-SearchCatCache4(CatCache *cache,
+static HeapTuple
+SearchCatCache4b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3, Datum v4)
 {
     return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4);
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 8105f19bc4..f2e0d29bc8 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -213,15 +213,15 @@ extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
                               int nbuckets);
 extern void InitCatCachePhase2(CatCache *cache, bool touch_index);
 
-extern HeapTuple SearchCatCache(CatCache *cache,
+extern HeapTuple (*SearchCatCache)(CatCache *cache,
                                 Datum v1, Datum v2, Datum v3, Datum v4);
-extern HeapTuple SearchCatCache1(CatCache *cache,
+extern HeapTuple (*SearchCatCache1)(CatCache *cache,
                                  Datum v1);
-extern HeapTuple SearchCatCache2(CatCache *cache,
+extern HeapTuple (*SearchCatCache2)(CatCache *cache,
                                  Datum v1, Datum v2);
-extern HeapTuple SearchCatCache3(CatCache *cache,
+extern HeapTuple (*SearchCatCache3)(CatCache *cache,
                                  Datum v1, Datum v2, Datum v3);
-extern HeapTuple SearchCatCache4(CatCache *cache,
+extern HeapTuple (*SearchCatCache4)(CatCache *cache,
                                  Datum v1, Datum v2, Datum v3, Datum v4);
 extern void ReleaseCatCache(HeapTuple tuple);

#! /usr/bin/perl
$collist = "";
foreach $i (0..1000) {
    $collist .= sprintf(", c%05d int", $i);
}
$collist = substr($collist, 2);

printf "drop schema if exists test cascade;\n";
printf "create schema test;\n";
foreach $i (0..2999) {
    printf "create table test.t%04d ($collist);\n", $i;
}
#!/bin/bash
LOOPS=3
USES=1
BINROOT=/home/horiguti/bin
DATADIR=/home/horiguti/data/data_catexp
PREC="numeric(10,2)"

/usr/bin/killall postgres
/usr/bin/sleep 3

run() {
    local BINARY=$1
    local PGCTL=$2/bin/pg_ctl
    local PGSQL=$2/bin/postgres
    local PSQL=$2/bin/psql

    if [ "$3" != "" ]; then
      local SETTING1="set catalog_cache_prune_min_age to \"$3\";"
      local SETTING2="set catalog_cache_prune_min_age to \"$4\";"
      local SETTING3="set catalog_cache_prune_min_age to \"$5\";"
    fi

#    ($PGSQL -D $DATADIR 2>&1 > /dev/null)&
    ($PGSQL -D $DATADIR 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /')&
    /usr/bin/sleep 3
    ${PSQL} postgres <<EOF
create extension if not exists catcachebench;
select catcachebench(0);

$SETTING3

select * from generate_series(2, 2) test,
LATERAL 
  (select '${BINARY}' as version,
          '${USES}/' || (count(r) OVER())::text as n,
          r::${PREC},
          (stddev(r) OVER ())::${PREC}
   from (select catcachebench(test) as r
         from generate_series(1, ${LOOPS})) r
   order by r limit ${USES}) r

EOF
    $PGCTL --pgdata=$DATADIR stop 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /'

#    oreport > $BINARY_perf.txt
}

for i in $(seq 0 3); do
run "E_off" $BINROOT/pgsql_catexpe "-1" "-1" "-1"
#run "E_on" $BINROOT/pgsql_catexpe "300s" "1s" "0"
run "master" $BINROOT/pgsql_master_o2 "" "" ""
done


Re: Protect syscache from bloating with negative cache entries

От
Michael Paquier
Дата:
On Tue, Nov 19, 2019 at 07:48:10PM +0900, Kyotaro Horiguchi wrote:
> I'd like to throw in food for discussion on how much SearchSysCacheN
> suffers degradation from some choices on how we can insert a code into
> the SearchSysCacheN code path.

Please note that the patch has a warning, causing cfbot-san to
complain:
catcache.c:786:1: error: no previous prototype for
‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]
 CatalogCacheFlushCatalog2(Oid catId)
 ^
cc1: all warnings being treated as errors

So this should at least be fixed.  For now I have moved it to next CF,
waiting on author.
--
Michael

Вложения

Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
This is a new complete workable patch after a long time of struggling
with benchmarking.

At Tue, 19 Nov 2019 19:48:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> I ran the run2.sh script attached, which runs catcachebench2(), which
> asks SearchSysCache3() for cached entries (almost) 240000 times per
> run.  The number of each output line is the mean of 3 times runs, and
> stddev. Lines are in "time" order and edited to fit here. "gen_tbl.pl
> | psql" creates a database for the benchmark. catcachebench2() runs
> the shortest path in the three in the attached benchmark program.
> 
> (pg_ctl start)
> $ perl gen_tbl.pl | psql ...
> (pg_ctl stop)

I wonder why I took the average of the time instead of choose the
fastest one.  This benchmark is extremely CPU intensive so the fastest
run reliably represents the perfromance.

I changed the benchmark so that it shows the time of the fastest run
(run4.sh). Based on the latest result, I used the pattern 3
(SearchSyscacheN indirection, but wrongly pointed as 1 in the last
mail) in the latest version,

I took the number of the fastest time among 3 iteration of 5 runs of
both master/patched O2 binaries.

 version |   min   
---------+---------
 master  | 7986.65 
 patched | 7984.47   = 'indirect' below

I would say this version doesn't get degradaded by indirect calls.
So, I applied the other part of the catcache expiration patch as the
succeeding parts. After that I got somewhat strange but very stable
result.  Just adding struct members acceleartes the benchmark. The
numbers are the fastest time of 20 runs of the bencmark in 10 times
iterations.

              ms
master      7980.79   # the master with the benchmark extension (0001)
=====
base        7340.96   # add only struct members and a GUC variable. (0002)
indirect    7998.68   # call SearchCatCacheN indirectly (0003)
=====
expire-off  7422.30   # CatCache expiration (0004)
                      # (catalog_cache_prune_min_age = -1)
expire-on   7861.13   # CatCache expiration (catalog_cache_prune_min_age = 0)


The patch accelerates CatCaCheSearch for uncertain reasons. I'm not
sure what makes the difference between about 8000ms and about 7400 ms,
though. Several times building of all versions then running the
benchmark gave me the results with the same tendency. I once stop this
work at this point and continue later. The following files are
attached.

0001-catcache-benchmark-extension.patch:
  benchmnark extension used by the benchmarking here.  The test tables
  are generated using gentbl2.pl attached. (perl gentbk2.pl | psql)

0002-base_change.patch:
  Preliminary adds some struct members and a GUC variable to see if
  they cause any extent of degradation.

0003-Make-CatCacheSearchN-indirect-functions.patch:
  Rewrite to change CatCacheSearchN functions to be called indirectly.

0004-CatCache-expiration-feature.patch:
  Add CatCache expiration feature.

gentbl2.pl: A script that emits SQL statements to generate test tables.
run4.sh   : The test script I used for benchmarkiing here.
build2.sh : A script I used to build the four types of binaries used here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From dacf4a2ac9eb49099e744ee24066b94e9f78aa61 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 14 Nov 2019 19:24:36 +0900
Subject: [PATCH 1/4] catcache benchmark extension

Provides the function catcachebench(bench_no int), which runs CPU
intensive benchmark on catcache search. The test table is created by a
script separately provided.

catcachebench(0): prewarm catcache with provided test tables.
catcachebench(1): fetches all attribute stats of all tables.
    This benchmark loads a vast number of unique entries.
    Expriration doesn't work since it runs in a transaction.
catcachebench(2): fetches all attribute stats of a tables many times.
    This benchmark repeatedly accesses already loaded entries.
    Expriration doesn't work since it runs in a transaction.
catcachebench(3): fetches all attribute stats of all tables four times.
    Different from other modes, this runs expiration by forcibly
    updates reference clock variable every 1000 entries.

At this point, variables needed for the expiration feature is not
added so SetCatCacheClock is a dummy macro that just replaces it with
its parameter.
---
 contrib/catcachebench/Makefile               |  17 +
 contrib/catcachebench/catcachebench--0.0.sql |  14 +
 contrib/catcachebench/catcachebench.c        | 330 +++++++++++++++++++
 contrib/catcachebench/catcachebench.control  |   6 +
 src/backend/utils/cache/catcache.c           |  35 ++
 src/backend/utils/cache/syscache.c           |   2 +-
 src/include/utils/catcache.h                 |   3 +
 7 files changed, 406 insertions(+), 1 deletion(-)
 create mode 100644 contrib/catcachebench/Makefile
 create mode 100644 contrib/catcachebench/catcachebench--0.0.sql
 create mode 100644 contrib/catcachebench/catcachebench.c
 create mode 100644 contrib/catcachebench/catcachebench.control

diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile
new file mode 100644
index 0000000000..0478818b25
--- /dev/null
+++ b/contrib/catcachebench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = catcachebench
+OBJS = catcachebench.o
+
+EXTENSION = catcachebench
+DATA = catcachebench--0.0.sql
+PGFILEDESC = "catcachebench - benchmark for catcache pruning feature"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/catcachebench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql
new file mode 100644
index 0000000000..ea9cd62abb
--- /dev/null
+++ b/contrib/catcachebench/catcachebench--0.0.sql
@@ -0,0 +1,14 @@
+/* contrib/catcachebench/catcachebench--0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit
+
+CREATE FUNCTION catcachebench(IN type int)
+RETURNS double precision
+AS 'MODULE_PATHNAME', 'catcachebench'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits
bigint)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'catcachereadstats'
+LANGUAGE C STRICT VOLATILE;
diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
new file mode 100644
index 0000000000..b6c2b8f577
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.c
@@ -0,0 +1,330 @@
+/*
+ * catcachebench: test code for cache pruning feature
+ */
+/* #define CATCACHE_STATS */
+#include "postgres.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+
+Oid        tableoids[10000];
+int        ntables = 0;
+int16    attnums[1000];
+int        natts = 0;
+
+PG_MODULE_MAGIC;
+
+double catcachebench1(void);
+double catcachebench2(void);
+double catcachebench3(void);
+void collectinfo(void);
+void catcachewarmup(void);
+
+PG_FUNCTION_INFO_V1(catcachebench);
+PG_FUNCTION_INFO_V1(catcachereadstats);
+
+extern void CatalogCacheFlushCatalog2(Oid catId);
+extern int64 catcache_called;
+extern CatCache *SysCache[];
+
+typedef struct catcachestatsstate
+{
+    TupleDesc tupd;
+    int          catId;
+} catcachestatsstate;
+
+Datum
+catcachereadstats(PG_FUNCTION_ARGS)
+{
+    catcachestatsstate *state_data = NULL;
+    FuncCallContext *fctx;
+
+    if (SRF_IS_FIRSTCALL())
+    {
+        TupleDesc    tupdesc;
+        MemoryContext mctx;
+
+        fctx = SRF_FIRSTCALL_INIT();
+        mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+        state_data = palloc(sizeof(catcachestatsstate));
+
+        /* Build a tuple descriptor for our result type */
+        if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+            elog(ERROR, "return type must be a row type");
+
+        state_data->tupd = tupdesc;
+        state_data->catId = 0;
+
+        fctx->user_fctx = state_data;
+
+        MemoryContextSwitchTo(mctx);
+    }
+
+    fctx = SRF_PERCALL_SETUP();
+    state_data = fctx->user_fctx;
+
+    if (state_data->catId < SysCacheSize)
+    {
+        Datum    values[5];
+        bool    nulls[5];
+        HeapTuple    resulttup;
+        Datum    result;
+        int        catId = state_data->catId++;
+
+        memset(nulls, 0, sizeof(nulls));
+        memset(values, 0, sizeof(values));
+        values[0] = Int16GetDatum(catId);
+        values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid);
+#ifdef CATCACHE_STATS        
+        values[2] = Int64GetDatum(SysCache[catId]->cc_searches);
+        values[3] = Int64GetDatum(SysCache[catId]->cc_hits);
+        values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits);
+#endif
+        resulttup = heap_form_tuple(state_data->tupd, values, nulls);
+        result = HeapTupleGetDatum(resulttup);
+
+        SRF_RETURN_NEXT(fctx, result);
+    }
+
+    SRF_RETURN_DONE(fctx);
+}
+
+Datum
+catcachebench(PG_FUNCTION_ARGS)
+{
+    int        testtype = PG_GETARG_INT32(0);
+    double    ms;
+
+    collectinfo();
+
+    /* flush the catalog -- safe? don't mind. */
+    CatalogCacheFlushCatalog2(StatisticRelationId);
+
+    switch (testtype)
+    {
+    case 0:
+        catcachewarmup(); /* prewarm of syscatalog */
+        PG_RETURN_NULL();
+    case 1:
+        ms = catcachebench1(); break;
+    case 2:
+        ms = catcachebench2(); break;
+    case 3:
+        ms = catcachebench3(); break;
+    default:
+        elog(ERROR, "Invalid test type: %d", testtype);
+    }
+
+    PG_RETURN_DATUM(Float8GetDatum(ms));
+}
+
+/*
+ * fetch all attribute entires of all tables.
+ */
+double
+catcachebench1(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of a table many times.
+ */
+double
+catcachebench2(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < 240000 ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[0]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of all tables several times with having
+ * expiration happen.
+ */
+double
+catcachebench3(void)
+{
+    const int clock_step = 1000;
+    int i, t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (i = 0 ; i < 4 ; i++)
+    {
+        int ct = clock_step;
+
+        for (t = 0 ; t < ntables ; t++)
+        {
+            /*
+             * catcacheclock is updated by transaction timestamp, so needs to
+             * be updated by other means for this test to work. Here I choosed
+             * to update the clock every 1000 tables scan.
+             */
+            if (--ct < 0)
+            {
+                SetCatCacheClock(GetCurrentTimestamp());
+                ct = clock_step;
+            }
+            for (a = 0 ; a < natts ; a++)
+            {
+                HeapTuple tup;
+
+                tup = SearchSysCache3(STATRELATTINH,
+                                      ObjectIdGetDatum(tableoids[t]),
+                                      Int16GetDatum(attnums[a]),
+                                      BoolGetDatum(false));
+                /* should be null, but.. */
+                if (HeapTupleIsValid(tup))
+                    ReleaseSysCache(tup);
+            }
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+void
+catcachewarmup(void)
+{
+    int t, a;
+
+    /* load up catalog tables */
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+}
+
+void
+collectinfo(void)
+{
+    int ret;
+    Datum    values[10000];
+    bool    nulls[10000];
+    Oid        types0[] = {OIDOID};
+    int i;
+
+    ntables = 0;
+    natts = 0;
+
+    SPI_connect();
+    /* collect target tables */
+    ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname =
\'test\')",
+                      true, 0);
+    if (ret != SPI_OK_SELECT)
+        elog(ERROR, "Failed 1");
+    if (SPI_processed == 0)
+        elog(ERROR, "no relation found in schema \"test\"");
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in schema \"test\"");
+
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 2");
+
+        tableoids[ntables++] = DatumGetObjectId(values[0]);
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d tables found", ntables);
+
+    values[0] = ObjectIdGetDatum(tableoids[0]);
+    nulls[0] = false;
+    SPI_connect();
+    ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid
=$1)",
 
+                                1, types0, values, NULL, true, 0);
+    if (SPI_processed == 0)
+        elog(ERROR, "no attribute found in table %d", tableoids[0]);
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in table %d", tableoids[0]);
+    
+    /* collect target attributes. assuming all tables have the same attnums */
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        int16 attnum;
+
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 3");
+        attnum = DatumGetInt16(values[0]);
+
+        if (attnum > 0)
+            attnums[natts++] = attnum;
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d attributes found", natts);
+}
diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control
new file mode 100644
index 0000000000..3fc9d2e420
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.control
@@ -0,0 +1,6 @@
+# catcachebench
+
+comment = 'benchmark for catcache pruning'
+default_version = '0.0'
+module_pathname = '$libdir/catcachebench'
+relocatable = true
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 64776e3209..95a4e30d2b 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -740,6 +740,41 @@ CatalogCacheFlushCatalog(Oid catId)
     CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
 }
 
+
+/* FUNCTION FOR BENCHMARKING */
+void
+CatalogCacheFlushCatalog2(Oid catId)
+{
+    slist_iter    iter;
+
+    CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId);
+
+    slist_foreach(iter, &CacheHdr->ch_caches)
+    {
+        CatCache   *cache = slist_container(CatCache, cc_next, iter.cur);
+
+        /* Does this cache store tuples of the target catalog? */
+        if (cache->cc_reloid == catId)
+        {
+            /* Yes, so flush all its contents */
+            ResetCatalogCache(cache);
+
+            /* Tell inval.c to call syscache callbacks for this cache */
+            CallSyscacheCallbacks(cache->id, 0);
+
+            cache->cc_nbuckets = 128;
+            pfree(cache->cc_bucket);
+            cache->cc_bucket = palloc0(128 * sizeof(dlist_head));
+            ereport(DEBUG1,
+                    (errmsg("Catcache reset"),
+                     errhidestmt(true)));
+        }
+    }
+
+    CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
+}
+/* END: FUNCTION FOR BENCHMARKING */
+
 /*
  *        InitCatCache
  *
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index 53d9ddf159..1c79a85a8c 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -983,7 +983,7 @@ static const struct cachedesc cacheinfo[] = {
     }
 };
 
-static CatCache *SysCache[SysCacheSize];
+CatCache *SysCache[SysCacheSize];
 
 static bool CacheInitialized = false;
 
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index f4aa316604..ea9e75a1ae 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -228,4 +228,7 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* tentative change to allow benchmark on master branch */
+#define SetCatCacheClock(ts) (ts)
+
 #endif                            /* CATCACHE_H */
-- 
2.23.0

From a18c8f531c685682b22d304efa8bfb31401cc3b0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 10 Jan 2020 15:02:26 +0900
Subject: [PATCH 2/4] base_change

Adds struct members needed by catcache expiration feature and a GUC
variable that controls the behavior of the feature. But no substantial
code is not added yet. This also replaces SetCatCacheClock() with the
real definition.

If existence of some variables alone can cause degradation,
benchmarking after this patch shows that.
---
 src/backend/utils/cache/catcache.c | 15 +++++++++++++++
 src/backend/utils/misc/guc.c       | 13 +++++++++++++
 src/include/utils/catcache.h       | 23 ++++++++++++++++++++---
 3 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 95a4e30d2b..d267e5ce6e 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -60,9 +60,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                                                int nkeys,
                                                Datum v1, Datum v2,
@@ -99,6 +108,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                              Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    catalog_cache_prune_min_age = newval;
+}
 
 /*
  *                    internal support functions
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 62285792ec..2f2b599f61 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -83,6 +83,8 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
+#include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
@@ -2280,6 +2282,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index ea9e75a1ae..3d3870f05a 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    TimestampTz    cc_oldest_ts;    /* timestamp of the oldest tuple in the hash */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    unsigned int naccess;        /* # of access to this entry */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +193,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
@@ -228,7 +248,4 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
-/* tentative change to allow benchmark on master branch */
-#define SetCatCacheClock(ts) (ts)
-
 #endif                            /* CATCACHE_H */
-- 
2.23.0

From 5327bfd024ba9e5313cca39db4a1e986c299ca16 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 9 Jan 2020 19:22:18 +0900
Subject: [PATCH 3/4] Make CatCacheSearchN indirect functions

After some expriments showed that the best way to add a new feature to
the current CatCacheSearch path is making SearchCatCacheN functions
replacable using indirect calling. This patch does that.

If the change of how to call the functions alone causes degradataion,
benchmarking after this patch applied shows that.
---
 src/backend/utils/cache/catcache.c | 42 +++++++++++++++++++++++-------
 src/include/utils/catcache.h       | 40 ++++++++++++++++++++++++----
 2 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d267e5ce6e..74c893ba4e 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -84,6 +84,15 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 Datum v1, Datum v2,
                                                 Datum v3, Datum v4);
 
+static HeapTuple SearchCatCacheb(CatCache *cache,
+                                 Datum v1, Datum v2, Datum v3, Datum v4);
+static HeapTuple SearchCatCache1b(CatCache *cache, Datum v1);
+static HeapTuple SearchCatCache2b(CatCache *cache, Datum v1, Datum v2);
+static HeapTuple SearchCatCache3b(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3);
+static HeapTuple SearchCatCache4b(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3, Datum v4);
+
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                                            Datum v1, Datum v2, Datum v3, Datum v4);
 static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys,
@@ -108,6 +117,16 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                              Datum *srckeys, Datum *dstkeys);
 
+static SearchCatCacheFuncsType catcache_base = {
+    SearchCatCacheb,
+    SearchCatCache1b,
+    SearchCatCache2b,
+    SearchCatCache3b,
+    SearchCatCache4b
+};
+
+SearchCatCacheFuncsType *SearchCatCacheFuncs = NULL;
+
 /* GUC assign function */
 void
 assign_catalog_cache_prune_min_age(int newval, void *extra)
@@ -852,6 +871,9 @@ InitCatCache(int id,
         CacheHdr = (CatCacheHeader *) palloc(sizeof(CatCacheHeader));
         slist_init(&CacheHdr->ch_caches);
         CacheHdr->ch_ntup = 0;
+
+        SearchCatCacheFuncs = &catcache_base;
+
 #ifdef CATCACHE_STATS
         /* set up to dump stats at backend exit */
         on_proc_exit(CatCachePrintStats, 0);
@@ -1193,8 +1215,8 @@ IndexScanOK(CatCache *cache, ScanKey cur_skey)
  * the caller need not go to the trouble of converting it to a fully
  * null-padded NAME.
  */
-HeapTuple
-SearchCatCache(CatCache *cache,
+static HeapTuple
+SearchCatCacheb(CatCache *cache,
                Datum v1,
                Datum v2,
                Datum v3,
@@ -1210,32 +1232,32 @@ SearchCatCache(CatCache *cache,
  * bit faster than SearchCatCache().
  */
 
-HeapTuple
-SearchCatCache1(CatCache *cache,
+static HeapTuple
+SearchCatCache1b(CatCache *cache,
                 Datum v1)
 {
     return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0);
 }
 
 
-HeapTuple
-SearchCatCache2(CatCache *cache,
+static HeapTuple
+SearchCatCache2b(CatCache *cache,
                 Datum v1, Datum v2)
 {
     return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0);
 }
 
 
-HeapTuple
-SearchCatCache3(CatCache *cache,
+static HeapTuple
+SearchCatCache3b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3)
 {
     return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0);
 }
 
 
-HeapTuple
-SearchCatCache4(CatCache *cache,
+static HeapTuple
+SearchCatCache4b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3, Datum v4)
 {
     return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4);
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 3d3870f05a..f9e9889339 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -189,6 +189,36 @@ typedef struct catcacheheader
     int            ch_ntup;        /* # of tuples in all caches */
 } CatCacheHeader;
 
+typedef HeapTuple (*SearchCatCache_fn)(CatCache *cache,
+                                       Datum v1, Datum v2, Datum v3, Datum v4);
+typedef HeapTuple (*SearchCatCache1_fn)(CatCache *cache, Datum v1);
+typedef HeapTuple (*SearchCatCache2_fn)(CatCache *cache, Datum v1, Datum v2);
+typedef HeapTuple (*SearchCatCache3_fn)(CatCache *cache, Datum v1, Datum v2,
+                                        Datum v3);
+typedef HeapTuple (*SearchCatCache4_fn)(CatCache *cache,
+                                        Datum v1, Datum v2, Datum v3, Datum v4);
+
+typedef struct SearchCatCacheFuncsType
+{
+    SearchCatCache_fn    SearchCatCache;
+    SearchCatCache1_fn    SearchCatCache1;
+    SearchCatCache2_fn    SearchCatCache2;
+    SearchCatCache3_fn    SearchCatCache3;
+    SearchCatCache4_fn    SearchCatCache4;
+} SearchCatCacheFuncsType;
+
+extern PGDLLIMPORT SearchCatCacheFuncsType *SearchCatCacheFuncs;
+
+#define SearchCatCache(cache, v1, v2, v3, v4) \
+    SearchCatCacheFuncs->SearchCatCache(cache, v1, v2, v3, v4)
+#define SearchCatCache1(cache, v1) \
+    SearchCatCacheFuncs->SearchCatCache1(cache, v1)
+#define SearchCatCache2(cache, v1, v2) \
+    SearchCatCacheFuncs->SearchCatCache2(cache, v1, v2)
+#define SearchCatCache3(cache, v1, v2, v3) \
+    SearchCatCacheFuncs->SearchCatCache3(cache, v1, v2, v3)
+#define SearchCatCache4(cache, v1, v2, v3, v4) \
+    SearchCatCacheFuncs->SearchCatCache4(cache, v1, v2, v3, v4)
 
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
@@ -216,15 +246,15 @@ extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
                               int nbuckets);
 extern void InitCatCachePhase2(CatCache *cache, bool touch_index);
 
-extern HeapTuple SearchCatCache(CatCache *cache,
+extern HeapTuple (*SearchCatCache)(CatCache *cache,
                                 Datum v1, Datum v2, Datum v3, Datum v4);
-extern HeapTuple SearchCatCache1(CatCache *cache,
+extern HeapTuple (*SearchCatCache1)(CatCache *cache,
                                  Datum v1);
-extern HeapTuple SearchCatCache2(CatCache *cache,
+extern HeapTuple (*SearchCatCache2)(CatCache *cache,
                                  Datum v1, Datum v2);
-extern HeapTuple SearchCatCache3(CatCache *cache,
+extern HeapTuple (*SearchCatCache3)(CatCache *cache,
                                  Datum v1, Datum v2, Datum v3);
-extern HeapTuple SearchCatCache4(CatCache *cache,
+extern HeapTuple (*SearchCatCache4)(CatCache *cache,
                                  Datum v1, Datum v2, Datum v3, Datum v4);
 extern void ReleaseCatCache(HeapTuple tuple);
 
-- 
2.23.0

From bef3df3bb2a0c2340eadf267cdfaf8d40612cd0c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 10 Jan 2020 15:08:54 +0900
Subject: [PATCH 4/4] CatCache expiration feature.

This adds the catcache expiration feature to the catcache mechanism.

Current catcache doesn't remove a entry and there's a case where many
hash entries occupy large amont of memory , being not accessed ever
after. This is a quire serious issue on the cases of long-running
sessions.  The expiration feature saves the case in exchange of some
extent of degradation if it is turned on.
---
 src/backend/utils/cache/catcache.c | 343 +++++++++++++++++++++++++++--
 1 file changed, 326 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 74c893ba4e..35e1a07e57 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -72,10 +72,11 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Clock for the last accessed time of a catcache entry. */
 TimestampTz    catcacheclock = 0;
 
-static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
-                                               int nkeys,
-                                               Datum v1, Datum v2,
-                                               Datum v3, Datum v4);
+/* basic catcache search functions */
+static inline HeapTuple SearchCatCacheInternalb(CatCache *cache,
+                                                int nkeys,
+                                                Datum v1, Datum v2,
+                                                Datum v3, Datum v4);
 
 static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 int nkeys,
@@ -93,6 +94,23 @@ static HeapTuple SearchCatCache3b(CatCache *cache,
 static HeapTuple SearchCatCache4b(CatCache *cache,
                                   Datum v1, Datum v2, Datum v3, Datum v4);
 
+/* catcache search functions with expiration feature */
+static inline HeapTuple SearchCatCacheInternale(CatCache *cache,
+                                                int nkeys,
+                                                Datum v1, Datum v2,
+                                                Datum v3, Datum v4);
+
+static HeapTuple SearchCatCachee(CatCache *cache,
+                                 Datum v1, Datum v2, Datum v3, Datum v4);
+static HeapTuple SearchCatCache1e(CatCache *cache, Datum v1);
+static HeapTuple SearchCatCache2e(CatCache *cache, Datum v1, Datum v2);
+static HeapTuple SearchCatCache3e(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3);
+static HeapTuple SearchCatCache4e(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3, Datum v4);
+
+static bool CatCacheCleanupOldEntries(CatCache *cp);
+
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                                            Datum v1, Datum v2, Datum v3, Datum v4);
 static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys,
@@ -125,13 +143,35 @@ static SearchCatCacheFuncsType catcache_base = {
     SearchCatCache4b
 };
 
+static SearchCatCacheFuncsType catcache_expire = {
+    SearchCatCachee,
+    SearchCatCache1e,
+    SearchCatCache2e,
+    SearchCatCache3e,
+    SearchCatCache4e
+};
+
 SearchCatCacheFuncsType *SearchCatCacheFuncs = NULL;
 
+/* set catcache function set according to guc variables */
+static void
+set_catcache_functions(void)
+{
+    if (catalog_cache_prune_min_age < 0)
+        SearchCatCacheFuncs = &catcache_base;
+    else
+        SearchCatCacheFuncs = &catcache_expire;
+}
+
+
 /* GUC assign function */
 void
 assign_catalog_cache_prune_min_age(int newval, void *extra)
 {
     catalog_cache_prune_min_age = newval;
+
+    /* choose corresponding function set */
+    set_catcache_functions();
 }
 
 /*
@@ -872,7 +912,7 @@ InitCatCache(int id,
         slist_init(&CacheHdr->ch_caches);
         CacheHdr->ch_ntup = 0;
 
-        SearchCatCacheFuncs = &catcache_base;
+        set_catcache_functions();
 
 #ifdef CATCACHE_STATS
         /* set up to dump stats at backend exit */
@@ -938,6 +978,10 @@ RehashCatCache(CatCache *cp)
     elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets",
          cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets);
 
+    /* try removing old entries before expanding hash */
+    if (CatCacheCleanupOldEntries(cp))
+        return;
+
     /* Allocate a new, larger, hash table. */
     newnbuckets = cp->cc_nbuckets * 2;
     newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head));
@@ -1222,7 +1266,7 @@ SearchCatCacheb(CatCache *cache,
                Datum v3,
                Datum v4)
 {
-    return SearchCatCacheInternal(cache, cache->cc_nkeys, v1, v2, v3, v4);
+    return SearchCatCacheInternalb(cache, cache->cc_nkeys, v1, v2, v3, v4);
 }
 
 
@@ -1236,7 +1280,7 @@ static HeapTuple
 SearchCatCache1b(CatCache *cache,
                 Datum v1)
 {
-    return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0);
+    return SearchCatCacheInternalb(cache, 1, v1, 0, 0, 0);
 }
 
 
@@ -1244,7 +1288,7 @@ static HeapTuple
 SearchCatCache2b(CatCache *cache,
                 Datum v1, Datum v2)
 {
-    return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0);
+    return SearchCatCacheInternalb(cache, 2, v1, v2, 0, 0);
 }
 
 
@@ -1252,7 +1296,7 @@ static HeapTuple
 SearchCatCache3b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3)
 {
-    return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0);
+    return SearchCatCacheInternalb(cache, 3, v1, v2, v3, 0);
 }
 
 
@@ -1260,19 +1304,19 @@ static HeapTuple
 SearchCatCache4b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3, Datum v4)
 {
-    return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4);
+    return SearchCatCacheInternalb(cache, 4, v1, v2, v3, v4);
 }
 
 /*
- * Work-horse for SearchCatCache/SearchCatCacheN.
+ * Work-horse for SearchCatCacheb/SearchCatCacheNb.
  */
 static inline HeapTuple
-SearchCatCacheInternal(CatCache *cache,
-                       int nkeys,
-                       Datum v1,
-                       Datum v2,
-                       Datum v3,
-                       Datum v4)
+SearchCatCacheInternalb(CatCache *cache,
+                        int nkeys,
+                        Datum v1,
+                        Datum v2,
+                        Datum v3,
+                        Datum v4)
 {
     Datum        arguments[CATCACHE_MAXKEYS];
     uint32        hashValue;
@@ -1497,6 +1541,269 @@ SearchCatCacheMiss(CatCache *cache,
     return &ct->tuple;
 }
 
+/*
+ *    SearchCatCache with entry pruning
+ *
+ *  These functions works the same way with SearchCatCacheNb() functions except
+ *  that less-used entries are removed following catalog_cache_prune_min_age
+ *  setting.
+ */
+static HeapTuple
+SearchCatCachee(CatCache *cache,
+               Datum v1,
+               Datum v2,
+               Datum v3,
+               Datum v4)
+{
+    return SearchCatCacheInternale(cache, cache->cc_nkeys, v1, v2, v3, v4);
+}
+
+
+/*
+ * SearchCatCacheN() are SearchCatCache() versions for a specific number of
+ * arguments. The compiler can inline the body and unroll loops, making them a
+ * bit faster than SearchCatCache().
+ */
+
+static HeapTuple
+SearchCatCache1e(CatCache *cache,
+                Datum v1)
+{
+    return SearchCatCacheInternale(cache, 1, v1, 0, 0, 0);
+}
+
+
+static HeapTuple
+SearchCatCache2e(CatCache *cache,
+                Datum v1, Datum v2)
+{
+    return SearchCatCacheInternale(cache, 2, v1, v2, 0, 0);
+}
+
+
+static HeapTuple
+SearchCatCache3e(CatCache *cache,
+                Datum v1, Datum v2, Datum v3)
+{
+    return SearchCatCacheInternale(cache, 3, v1, v2, v3, 0);
+}
+
+
+static HeapTuple
+SearchCatCache4e(CatCache *cache,
+                Datum v1, Datum v2, Datum v3, Datum v4)
+{
+    return SearchCatCacheInternale(cache, 4, v1, v2, v3, v4);
+}
+
+/*
+ * Work-horse for SearchCatCachee/SearchCatCacheNe.
+ */
+static inline HeapTuple
+SearchCatCacheInternale(CatCache *cache,
+                        int nkeys,
+                        Datum v1,
+                        Datum v2,
+                        Datum v3,
+                        Datum v4)
+{
+    Datum        arguments[CATCACHE_MAXKEYS];
+    uint32        hashValue;
+    Index        hashIndex;
+    dlist_iter    iter;
+    dlist_head *bucket;
+    CatCTup    *ct;
+
+    /* Make sure we're in an xact, even if this ends up being a cache hit */
+    Assert(IsTransactionState());
+
+    Assert(cache->cc_nkeys == nkeys);
+
+    /*
+     * one-time startup overhead for each cache
+     */
+    if (unlikely(cache->cc_tupdesc == NULL))
+        CatalogCacheInitializeCache(cache);
+
+#ifdef CATCACHE_STATS
+    cache->cc_searches++;
+#endif
+
+    /* Initialize local parameter array */
+    arguments[0] = v1;
+    arguments[1] = v2;
+    arguments[2] = v3;
+    arguments[3] = v4;
+
+    /*
+     * find the hash bucket in which to look for the tuple
+     */
+    hashValue = CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4);
+    hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
+
+    /*
+     * scan the hash bucket until we find a match or exhaust our tuples
+     *
+     * Note: it's okay to use dlist_foreach here, even though we modify the
+     * dlist within the loop, because we don't continue the loop afterwards.
+     */
+    bucket = &cache->cc_bucket[hashIndex];
+    dlist_foreach(iter, bucket)
+    {
+        ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+        if (ct->dead)
+            continue;            /* ignore dead entries */
+
+        if (ct->hash_value != hashValue)
+            continue;            /* quickly skip entry if wrong hash val */
+
+        if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments))
+            continue;
+
+        /*
+         * We found a match in the cache.  Move it to the front of the list
+         * for its hashbucket, in order to speed subsequent searches.  (The
+         * most frequently accessed elements in any hashbucket will tend to be
+         * near the front of the hashbucket's list.)
+         */
+        dlist_move_head(bucket, &ct->cache_elem);
+
+        /*
+         * Prolong life of this entry. Since we want run as less instructions
+         * as possible and want the branch be stable for performance reasons,
+         * we don't give a strict cap on the counter. All numbers above 1 will
+         * be regarded as 2 in CatCacheCleanupOldEntries().
+         */
+        ct->naccess++;
+        if (unlikely(ct->naccess == 0))
+            ct->naccess = 2;
+        ct->lastaccess = catcacheclock;
+
+        /*
+         * If it's a positive entry, bump its refcount and return it. If it's
+         * negative, we can report failure to the caller.
+         */
+        if (!ct->negative)
+        {
+            ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner);
+            ct->refcount++;
+            ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, &ct->tuple);
+
+            CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
+                       cache->cc_relname, hashIndex);
+
+#ifdef CATCACHE_STATS
+            cache->cc_hits++;
+#endif
+
+            return &ct->tuple;
+        }
+        else
+        {
+            CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
+                       cache->cc_relname, hashIndex);
+
+#ifdef CATCACHE_STATS
+            cache->cc_neg_hits++;
+#endif
+
+            return NULL;
+        }
+    }
+
+    return SearchCatCacheMiss(cache, nkeys, hashValue, hashIndex, v1, v2, v3, v4);
+}
+
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int        nremoved = 0;
+    int        i;
+    long    oldest_ts = catcacheclock;
+    long    age;
+    int        us;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age < 0)
+        return false;
+
+    /* Don't scan the hash when we know we don't have prunable entries */
+    TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us);
+    if (age < catalog_cache_prune_min_age)
+        return false;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount == 0 &&
+                (ct->c_list == NULL || ct->c_list->refcount == 0))
+            {
+                /*
+                 * Calculate the duration from the time from the last access
+                 * to the "current" time. catcacheclock is updated
+                 * per-statement basis and additionaly udpated periodically
+                 * during a long running query.
+                 */
+                TimestampDifference(ct->lastaccess, catcacheclock, &age, &us);
+
+                if (age > catalog_cache_prune_min_age)
+                {
+                    /*
+                     * Entries that are not accessed after the last pruning
+                     * are removed in that seconds, and their lives are
+                     * prolonged according to how many times they are accessed
+                     * up to three times of the duration. We don't try shrink
+                     * buckets since pruning effectively caps catcache
+                     * expansion in the long term.
+                     */
+                    if (ct->naccess > 2)
+                        ct->naccess = 1;
+                    else if (ct->naccess > 0)
+                        ct->naccess--;
+                    else
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+
+                        /* don't update oldest_ts by removed entry */
+                        continue;
+                    }
+                }
+            }
+
+            /* update oldest timestamp if the entry remains alive */
+            if (ct->lastaccess < oldest_ts)
+                oldest_ts = ct->lastaccess;
+        }
+    }
+
+    cp->cc_oldest_ts = oldest_ts;
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
+
 /*
  *    ReleaseCatCache
  *
@@ -1960,6 +2267,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
-- 
2.23.0

#! /usr/bin/perl
$collist = "";
foreach $i (0..1000) {
    $collist .= sprintf(", c%05d int", $i);
}
$collist = substr($collist, 2);

printf "drop schema if exists test cascade;\n";
printf "create schema test;\n";
printf "create table test.p ($collist) partition by list (c00000);\n";
foreach $i (0..2999) {
    printf "create table test.t%04d partition of test.p for values in (%d);\n", $i, $i;
}
#!/bin/bash
LOOPS=20
ITERATION=10
BINROOT=/home/horiguti/bin
DATADIR=/home/horiguti/data/data_catexpe
PREC="numeric(10,2)"

/usr/bin/killall postgres
/usr/bin/sleep 3

run() {
    local BINARY=$1
    local PGCTL=$2/bin/pg_ctl
    local PGSQL=$2/bin/postgres
    local PSQL=$2/bin/psql

    if [ "$3" != "" ]; then
      local SETTING1="set catalog_cache_prune_min_age to \"$3\";"
      local SETTING2="set catalog_cache_prune_min_age to \"$4\";"
      local SETTING3="set catalog_cache_prune_min_age to \"$5\";"
    fi

#    ($PGSQL -D $DATADIR 2>&1 > /dev/null)&
    ($PGSQL -D $DATADIR 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /')&
    /usr/bin/sleep 3
    ${PSQL} postgres <<EOF
create extension if not exists catcachebench;
select catcachebench(0);

$SETTING3

select * from generate_series(2, 2) test,
LATERAL 
  (select '${BINARY}' as version,
          count(r)::text || '/${LOOPS}' as n,
          min(r)::${PREC},
          stddev(r)::${PREC}
   from (select catcachebench(test) as r
            from generate_series(1, ${LOOPS})) r) r

EOF
    $PGCTL --pgdata=$DATADIR stop 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /'

#    oreport > $BINARY_perf.txt
}

for i in $(seq 0 ${ITERATION}); do
run "master" $BINROOT/pgsql_master_o2 "" "" ""
run "base" $BINROOT/pgsql_catexp-base "" "" ""
run "ind" $BINROOT/pgsql_catexp-ind "" "" ""
run "expire-off" $BINROOT/pgsql_catexpe "-1" "-1" "-1"
run "expire-on" $BINROOT/pgsql_catexpe "300s" "1s" "0"
done

#! /usr/bin/bash

BINROOT=/home/horiguti/bin/pgsql_

for i in master_o2 catexp-base catexp-ind catexpe; do rm -r /home/horiguti/bin/pgsql_$i/*; done
for i in master_o2 catexp-base catexp-ind catexpe; do ls -l /home/horiguti/bin/pgsql_$i/bin/postgres; done


function build () {
    echo $1
    make distclean
    git checkout $2
    git diff master..HEAD > diff_$1.txt
    ./configure --enable-debug --enable-tap-tests --enable-nls --with-openssl --with-libxml --with-llvm
--prefix=${BINROOT}$1LLVM_CONFIG="/usr/bin/llvm-config"
 
    make -sj8 all
    make install
    cd contrib/catcachebench
    make clean
    make all
    make install
    cd ../..
}

build "master_o2" "795e92756cd1"
build "catexp-base" "b2ebc9b4f1c"
build "catexp-ind" "631a04026d"
build "catexpe" "025e5e8a98d"

for i in master_o2 catexp-base catexp-ind catexpe; do ls -l /home/horiguti/bin/pgsql_$i/bin/postgres; done

Re: Protect syscache from bloating with negative cache entries

От
Tomas Vondra
Дата:
Hello Kyotaro-san,

I see this patch is stuck in WoA since 2019/12/01, although there's a
new patch version from 2020/01/14. But the patch seems to no longer
apply, at least according to https://commitfest.cputube.org :-( So at
this point the status is actually correct.

Not sure about the appveyor build (it seems to be about jsonb_set_lax),
but on travis it fails like this:

   catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]

so I'll leave it in WoA for now.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Protect syscache from bloating with negative cache entries

От
Alvaro Herrera
Дата:
On 2020-Jan-21, Tomas Vondra wrote:

> Not sure about the appveyor build (it seems to be about jsonb_set_lax),
> but on travis it fails like this:
> 
>   catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]

Hmm ... travis is running -Werror?  That seems overly strict.  I think
we shouldn't punt a patch because of that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Protect syscache from bloating with negative cache entries

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-Jan-21, Tomas Vondra wrote:
>> Not sure about the appveyor build (it seems to be about jsonb_set_lax),

FWIW, I think I fixed jsonb_set_lax yesterday, so that problem should
be gone the next time the cfbot tries this.

>> but on travis it fails like this:
>> catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]

> Hmm ... travis is running -Werror?  That seems overly strict.  I think
> we shouldn't punt a patch because of that.

Why not?  We're not going to allow pushing a patch that throws warnings
on common compilers.  Or if that does happen, some committer is going
to have to spend time cleaning it up.  Better to clean it up sooner.

(There is, btw, at least one buildfarm animal using -Werror.)

            regards, tom lane



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
Hello.

At Tue, 21 Jan 2020 14:17:53 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2020-Jan-21, Tomas Vondra wrote:
> >> Not sure about the appveyor build (it seems to be about jsonb_set_lax),
> 
> FWIW, I think I fixed jsonb_set_lax yesterday, so that problem should
> be gone the next time the cfbot tries this.
> 
> >> but on travis it fails like this:
> >> catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]
> 
> > Hmm ... travis is running -Werror?  That seems overly strict.  I think
> > we shouldn't punt a patch because of that.
> 
> Why not?  We're not going to allow pushing a patch that throws warnings
> on common compilers.  Or if that does happen, some committer is going
> to have to spend time cleaning it up.  Better to clean it up sooner.
> 
> (There is, btw, at least one buildfarm animal using -Werror.)

Mmm. The cause of the error is tentative (or crude or brute)
benchmarking function provided as an extension which is not actually a
part of the patch and was included for reviewer's convenience.
Howerver, I don't want it work on Windows build. If that is regarded
as a reason for being punt, I'll repost a new version without the
benchmark soon.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Protect syscache from bloating with negative cache entries

От
Michael Paquier
Дата:
On Tue, Jan 21, 2020 at 02:17:53PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Hmm ... travis is running -Werror?  That seems overly strict.  I think
>> we shouldn't punt a patch because of that.
>
> Why not?  We're not going to allow pushing a patch that throws warnings
> on common compilers.  Or if that does happen, some committer is going
> to have to spend time cleaning it up.  Better to clean it up sooner.
>
> (There is, btw, at least one buildfarm animal using -Werror.)

I agree that it is good to have in Mr Robot.  More early detection
means less follow-up cleanup.
--
Michael

Вложения

Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
At Tue, 21 Jan 2020 17:29:47 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in 
> I see this patch is stuck in WoA since 2019/12/01, although there's a
> new patch version from 2020/01/14. But the patch seems to no longer
> apply, at least according to https://commitfest.cputube.org :-( So at
> this point the status is actually correct.
> 
> Not sure about the appveyor build (it seems to be about
> jsonb_set_lax),
> but on travis it fails like this:
> 
>   catcache.c:820:1: error: no previous prototype for
>   ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes]

I changed my mind to attach the benchmark patch as .txt file,
expecting the checkers not picking up it as a part of the patchset.

I have in the precise performance measurement mode for a long time,
but I think it's settled. I'd like to return to normal mode and
explain this patch.

=== Motive of the patch

System cache is a mechanism that accelerates access to system catalogs
Basically the entries in a cache is removed via invalidation machanism
when corresponding system catalog entry is removed. On the other hand
the system cache also holds "negative" entries that indicates that the
object is nonexistent, which entry accelerates the response for
nonexistent objects. But the negative cache doesn't have a chance of
removal.

On a long-lived session that accepts a wide variety of queries on many
objects, system cache holds the cache entries for many objects that
are accessed once or a few times. Suppose every object is accessed
once per, say, 30 minutes, and the query doesn't needed to run in a
very short time. Such cache entries are almost useless but occupy a
large amount of memory.


=== Possible solutions.

Many caching system has expiration mechanism, which removes "useless"
entries to keep the size under a certain limit.  The limit is
typically defined by memory usage or expiration time, in a hard or
soft way.  Since we don't implement an detailed accouting of memory
usage by cache for the performance reasons, we can use coarse memory
accounting or expiration time.  This patch uses expiration time
because it can be detemined on a rather clearer basis.


=== Pruning timing

The next point is when to prune cache entries. Apparently it's not
reasonable to do on every cache access time, since pruning takes a far
longer time than cache access.

The system cache is implemented on a hash. When there's no room for a new cache entry, it gets twice in size and
rehashesall entries.  If pruning made some space for the new entry, rehashing can be avoided, so this patch tries
pruningjust before enlarging hash table.
 

A system cache can be shrinked if less than a half of the size is
used, but this patch doesn't that.  It is because we cannot predict if
the system cache that have just shrinked is going to enlarged just
after and I don't want get this patch that complex.


=== Performance

The pruning mechanism adds several entries to cache entry and updates

System cache is a very light-weight machinery so that inserting one
branch affects performance apparently. So in this patch, the new stuff
is isolated from existing code path using indirect call. After trials
on some call-points that can be indirect calls, I found that
SearchCatCache[1-4]() is the only point that doesn't affect
performance. (Please see upthread for details.)  That configuraion
also allows future implements of system caches, such like shared
system caches.

The alternative SearchCatCache[1-4] functions get a bit slower because
it maintains access timestamp and access counter.  Addition to that
pruning puts a certain amount of additional time if no entries are not
pruned off.


=== Pruning criteria

At the pruning time described above, every entry is examined agianst
the GUC variable catalog_cache_prune_min_age. The pruning mechanism
involves a clock-sweep-like mechanism where an entry lives longer if
it had accessed. Entry of which access counter is zero is pruned after
catalog_cache_prune_min_age. Otherwise an entry survives the pruning
round and its counter is decremented.

All the timestamp used by the stuff is "catcacheclock", which is
updated at every transaction start.


=== Concise test

The attached test1.pl can be used to replay the syscache-bloat caused
by negative entries. Setting $prune_age to -1, pruning is turned of
and you can see that the backend unlimitedly takes more and more
memory as time proceeds. Setting it to 10 or so, the memory size of
backend process will stops raising at certain amount.


=== The patch

The attached following are the patch. They have been separated for the
benchmarking reasons but that seems to make the patch more easy to
read so I leave it alone.  I forgot its correct version through a long
time of benchmarking so I started from v1 now.

- v1-0001-base_change.patch
  Adds new members to existing structs and catcacheclock-related code.

- v1-0002-Make-CatCacheSearchN-indirect-functions.patch
  Changes SearchCatCacheN functions to be called by indirect calls.

- v1-0003-CatCache-expiration-feature.patch
  The core code of the patch.

- catcache-benchmark-extension.patch.txt
  The benchmarking extension that was used for benchmarking
  upthread. Just for information.

- test1.pl
  Test script to make syscache bloat.


The patchset doesn't contain documentaion for the new GUC option. I
will add it later.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From fb80260907ac4ac0ff330806632f095484772fd1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 10 Jan 2020 15:02:26 +0900
Subject: [PATCH v1 1/3] base_change

Adds struct members needed by catcache expiration feature and a GUC
variable that controls the behavior of the feature. But no substantial
code is not added yet.

If existence of some variables alone can cause degradation,
benchmarking after this patch shows that.
---
 src/backend/access/transam/xact.c  |  3 +++
 src/backend/utils/cache/catcache.c | 15 +++++++++++++++
 src/backend/utils/misc/guc.c       | 13 +++++++++++++
 src/include/utils/catcache.h       | 20 ++++++++++++++++++++
 4 files changed, 51 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 017f03b6d8..1268a7fb80 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1067,6 +1067,9 @@ ForceSyncCommit(void)
 static void
 AtStart_Cache(void)
 {
+    if (xactStartTimestamp != 0)
+        SetCatCacheClock(xactStartTimestamp);
+
     AcceptInvalidationMessages();
 }
 
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 64776e3209..7248bd0d41 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -60,9 +60,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = 300;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                                                int nkeys,
                                                Datum v1, Datum v2,
@@ -99,6 +108,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                              Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    catalog_cache_prune_min_age = newval;
+}
 
 /*
  *                    internal support functions
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e44f71e991..3029e44d7a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -83,6 +83,8 @@
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
+#include "utils/guc_tables.h"
 #include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
@@ -2293,6 +2295,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        300, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
     /*
      * We use the hopefully-safely-small value of 100kB as the compiled-in
      * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index f4aa316604..3d3870f05a 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    TimestampTz    cc_oldest_ts;    /* timestamp of the oldest tuple in the hash */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    unsigned int naccess;        /* # of access to this entry */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +193,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.23.0

From 2b4449372acfbdf728a79f43dec0a0109c30228d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 9 Jan 2020 19:22:18 +0900
Subject: [PATCH v1 2/3] Make CatCacheSearchN indirect functions

After some expriments showed that the best way to add a new feature to
the current CatCacheSearch path is making SearchCatCacheN functions
replacable using indirect calling. This patch does that.

If the change of how to call the functions alone causes degradataion,
benchmarking after this patch applied shows that.
---
 src/backend/utils/cache/catcache.c | 42 +++++++++++++++++++++++-------
 src/include/utils/catcache.h       | 40 ++++++++++++++++++++++++----
 2 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 7248bd0d41..a4e3676a89 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -84,6 +84,15 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 Datum v1, Datum v2,
                                                 Datum v3, Datum v4);
 
+static HeapTuple SearchCatCacheb(CatCache *cache,
+                                 Datum v1, Datum v2, Datum v3, Datum v4);
+static HeapTuple SearchCatCache1b(CatCache *cache, Datum v1);
+static HeapTuple SearchCatCache2b(CatCache *cache, Datum v1, Datum v2);
+static HeapTuple SearchCatCache3b(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3);
+static HeapTuple SearchCatCache4b(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3, Datum v4);
+
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                                            Datum v1, Datum v2, Datum v3, Datum v4);
 static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys,
@@ -108,6 +117,16 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                              Datum *srckeys, Datum *dstkeys);
 
+static SearchCatCacheFuncsType catcache_base = {
+    SearchCatCacheb,
+    SearchCatCache1b,
+    SearchCatCache2b,
+    SearchCatCache3b,
+    SearchCatCache4b
+};
+
+SearchCatCacheFuncsType *SearchCatCacheFuncs = NULL;
+
 /* GUC assign function */
 void
 assign_catalog_cache_prune_min_age(int newval, void *extra)
@@ -817,6 +836,9 @@ InitCatCache(int id,
         CacheHdr = (CatCacheHeader *) palloc(sizeof(CatCacheHeader));
         slist_init(&CacheHdr->ch_caches);
         CacheHdr->ch_ntup = 0;
+
+        SearchCatCacheFuncs = &catcache_base;
+
 #ifdef CATCACHE_STATS
         /* set up to dump stats at backend exit */
         on_proc_exit(CatCachePrintStats, 0);
@@ -1158,8 +1180,8 @@ IndexScanOK(CatCache *cache, ScanKey cur_skey)
  * the caller need not go to the trouble of converting it to a fully
  * null-padded NAME.
  */
-HeapTuple
-SearchCatCache(CatCache *cache,
+static HeapTuple
+SearchCatCacheb(CatCache *cache,
                Datum v1,
                Datum v2,
                Datum v3,
@@ -1175,32 +1197,32 @@ SearchCatCache(CatCache *cache,
  * bit faster than SearchCatCache().
  */
 
-HeapTuple
-SearchCatCache1(CatCache *cache,
+static HeapTuple
+SearchCatCache1b(CatCache *cache,
                 Datum v1)
 {
     return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0);
 }
 
 
-HeapTuple
-SearchCatCache2(CatCache *cache,
+static HeapTuple
+SearchCatCache2b(CatCache *cache,
                 Datum v1, Datum v2)
 {
     return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0);
 }
 
 
-HeapTuple
-SearchCatCache3(CatCache *cache,
+static HeapTuple
+SearchCatCache3b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3)
 {
     return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0);
 }
 
 
-HeapTuple
-SearchCatCache4(CatCache *cache,
+static HeapTuple
+SearchCatCache4b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3, Datum v4)
 {
     return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4);
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 3d3870f05a..f9e9889339 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -189,6 +189,36 @@ typedef struct catcacheheader
     int            ch_ntup;        /* # of tuples in all caches */
 } CatCacheHeader;
 
+typedef HeapTuple (*SearchCatCache_fn)(CatCache *cache,
+                                       Datum v1, Datum v2, Datum v3, Datum v4);
+typedef HeapTuple (*SearchCatCache1_fn)(CatCache *cache, Datum v1);
+typedef HeapTuple (*SearchCatCache2_fn)(CatCache *cache, Datum v1, Datum v2);
+typedef HeapTuple (*SearchCatCache3_fn)(CatCache *cache, Datum v1, Datum v2,
+                                        Datum v3);
+typedef HeapTuple (*SearchCatCache4_fn)(CatCache *cache,
+                                        Datum v1, Datum v2, Datum v3, Datum v4);
+
+typedef struct SearchCatCacheFuncsType
+{
+    SearchCatCache_fn    SearchCatCache;
+    SearchCatCache1_fn    SearchCatCache1;
+    SearchCatCache2_fn    SearchCatCache2;
+    SearchCatCache3_fn    SearchCatCache3;
+    SearchCatCache4_fn    SearchCatCache4;
+} SearchCatCacheFuncsType;
+
+extern PGDLLIMPORT SearchCatCacheFuncsType *SearchCatCacheFuncs;
+
+#define SearchCatCache(cache, v1, v2, v3, v4) \
+    SearchCatCacheFuncs->SearchCatCache(cache, v1, v2, v3, v4)
+#define SearchCatCache1(cache, v1) \
+    SearchCatCacheFuncs->SearchCatCache1(cache, v1)
+#define SearchCatCache2(cache, v1, v2) \
+    SearchCatCacheFuncs->SearchCatCache2(cache, v1, v2)
+#define SearchCatCache3(cache, v1, v2, v3) \
+    SearchCatCacheFuncs->SearchCatCache3(cache, v1, v2, v3)
+#define SearchCatCache4(cache, v1, v2, v3, v4) \
+    SearchCatCacheFuncs->SearchCatCache4(cache, v1, v2, v3, v4)
 
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
@@ -216,15 +246,15 @@ extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
                               int nbuckets);
 extern void InitCatCachePhase2(CatCache *cache, bool touch_index);
 
-extern HeapTuple SearchCatCache(CatCache *cache,
+extern HeapTuple (*SearchCatCache)(CatCache *cache,
                                 Datum v1, Datum v2, Datum v3, Datum v4);
-extern HeapTuple SearchCatCache1(CatCache *cache,
+extern HeapTuple (*SearchCatCache1)(CatCache *cache,
                                  Datum v1);
-extern HeapTuple SearchCatCache2(CatCache *cache,
+extern HeapTuple (*SearchCatCache2)(CatCache *cache,
                                  Datum v1, Datum v2);
-extern HeapTuple SearchCatCache3(CatCache *cache,
+extern HeapTuple (*SearchCatCache3)(CatCache *cache,
                                  Datum v1, Datum v2, Datum v3);
-extern HeapTuple SearchCatCache4(CatCache *cache,
+extern HeapTuple (*SearchCatCache4)(CatCache *cache,
                                  Datum v1, Datum v2, Datum v3, Datum v4);
 extern void ReleaseCatCache(HeapTuple tuple);
 
-- 
2.23.0

From e348af29c3cae63212dcb3d982e419e53bc86517 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 10 Jan 2020 15:08:54 +0900
Subject: [PATCH v1 3/3] CatCache expiration feature.

This adds the catcache expiration feature to the catcache mechanism.

Current catcache doesn't remove an entry and there's a case where many
hash entries occupy large amont of memory , being not accessed ever
after. This can be a quite serious issue on the cases of long-running
sessions.  The expiration feature keeps process memory usage below
certain amount, in exchange of some extent of degradation if it is
turned on.
---
 src/backend/utils/cache/catcache.c | 343 +++++++++++++++++++++++++++--
 1 file changed, 326 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index a4e3676a89..29bc980d8e 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -72,10 +72,11 @@ static CatCacheHeader *CacheHdr = NULL;
 /* Clock for the last accessed time of a catcache entry. */
 TimestampTz    catcacheclock = 0;
 
-static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
-                                               int nkeys,
-                                               Datum v1, Datum v2,
-                                               Datum v3, Datum v4);
+/* basic catcache search functions */
+static inline HeapTuple SearchCatCacheInternalb(CatCache *cache,
+                                                int nkeys,
+                                                Datum v1, Datum v2,
+                                                Datum v3, Datum v4);
 
 static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 int nkeys,
@@ -93,6 +94,23 @@ static HeapTuple SearchCatCache3b(CatCache *cache,
 static HeapTuple SearchCatCache4b(CatCache *cache,
                                   Datum v1, Datum v2, Datum v3, Datum v4);
 
+/* catcache search functions with expiration feature */
+static inline HeapTuple SearchCatCacheInternale(CatCache *cache,
+                                                int nkeys,
+                                                Datum v1, Datum v2,
+                                                Datum v3, Datum v4);
+
+static HeapTuple SearchCatCachee(CatCache *cache,
+                                 Datum v1, Datum v2, Datum v3, Datum v4);
+static HeapTuple SearchCatCache1e(CatCache *cache, Datum v1);
+static HeapTuple SearchCatCache2e(CatCache *cache, Datum v1, Datum v2);
+static HeapTuple SearchCatCache3e(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3);
+static HeapTuple SearchCatCache4e(CatCache *cache,
+                                  Datum v1, Datum v2, Datum v3, Datum v4);
+
+static bool CatCacheCleanupOldEntries(CatCache *cp);
+
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                                            Datum v1, Datum v2, Datum v3, Datum v4);
 static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys,
@@ -125,13 +143,35 @@ static SearchCatCacheFuncsType catcache_base = {
     SearchCatCache4b
 };
 
+static SearchCatCacheFuncsType catcache_expire = {
+    SearchCatCachee,
+    SearchCatCache1e,
+    SearchCatCache2e,
+    SearchCatCache3e,
+    SearchCatCache4e
+};
+
 SearchCatCacheFuncsType *SearchCatCacheFuncs = NULL;
 
+/* set catcache function set according to guc variables */
+static void
+set_catcache_functions(void)
+{
+    if (catalog_cache_prune_min_age < 0)
+        SearchCatCacheFuncs = &catcache_base;
+    else
+        SearchCatCacheFuncs = &catcache_expire;
+}
+
+
 /* GUC assign function */
 void
 assign_catalog_cache_prune_min_age(int newval, void *extra)
 {
     catalog_cache_prune_min_age = newval;
+
+    /* choose corresponding function set */
+    set_catcache_functions();
 }
 
 /*
@@ -837,7 +877,7 @@ InitCatCache(int id,
         slist_init(&CacheHdr->ch_caches);
         CacheHdr->ch_ntup = 0;
 
-        SearchCatCacheFuncs = &catcache_base;
+        set_catcache_functions();
 
 #ifdef CATCACHE_STATS
         /* set up to dump stats at backend exit */
@@ -900,6 +940,10 @@ RehashCatCache(CatCache *cp)
     int            newnbuckets;
     int            i;
 
+    /* try removing old entries before expanding hash */
+    if (CatCacheCleanupOldEntries(cp))
+        return;
+
     elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets",
          cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets);
 
@@ -1187,7 +1231,7 @@ SearchCatCacheb(CatCache *cache,
                Datum v3,
                Datum v4)
 {
-    return SearchCatCacheInternal(cache, cache->cc_nkeys, v1, v2, v3, v4);
+    return SearchCatCacheInternalb(cache, cache->cc_nkeys, v1, v2, v3, v4);
 }
 
 
@@ -1201,7 +1245,7 @@ static HeapTuple
 SearchCatCache1b(CatCache *cache,
                 Datum v1)
 {
-    return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0);
+    return SearchCatCacheInternalb(cache, 1, v1, 0, 0, 0);
 }
 
 
@@ -1209,7 +1253,7 @@ static HeapTuple
 SearchCatCache2b(CatCache *cache,
                 Datum v1, Datum v2)
 {
-    return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0);
+    return SearchCatCacheInternalb(cache, 2, v1, v2, 0, 0);
 }
 
 
@@ -1217,7 +1261,7 @@ static HeapTuple
 SearchCatCache3b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3)
 {
-    return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0);
+    return SearchCatCacheInternalb(cache, 3, v1, v2, v3, 0);
 }
 
 
@@ -1225,19 +1269,19 @@ static HeapTuple
 SearchCatCache4b(CatCache *cache,
                 Datum v1, Datum v2, Datum v3, Datum v4)
 {
-    return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4);
+    return SearchCatCacheInternalb(cache, 4, v1, v2, v3, v4);
 }
 
 /*
- * Work-horse for SearchCatCache/SearchCatCacheN.
+ * Work-horse for SearchCatCacheb/SearchCatCacheNb.
  */
 static inline HeapTuple
-SearchCatCacheInternal(CatCache *cache,
-                       int nkeys,
-                       Datum v1,
-                       Datum v2,
-                       Datum v3,
-                       Datum v4)
+SearchCatCacheInternalb(CatCache *cache,
+                        int nkeys,
+                        Datum v1,
+                        Datum v2,
+                        Datum v3,
+                        Datum v4)
 {
     Datum        arguments[CATCACHE_MAXKEYS];
     uint32        hashValue;
@@ -1462,6 +1506,269 @@ SearchCatCacheMiss(CatCache *cache,
     return &ct->tuple;
 }
 
+/*
+ *    SearchCatCache with entry pruning
+ *
+ *  These functions works the same way with SearchCatCacheNb() functions except
+ *  that less-used entries are removed following catalog_cache_prune_min_age
+ *  setting.
+ */
+static HeapTuple
+SearchCatCachee(CatCache *cache,
+               Datum v1,
+               Datum v2,
+               Datum v3,
+               Datum v4)
+{
+    return SearchCatCacheInternale(cache, cache->cc_nkeys, v1, v2, v3, v4);
+}
+
+
+/*
+ * SearchCatCacheN() are SearchCatCache() versions for a specific number of
+ * arguments. The compiler can inline the body and unroll loops, making them a
+ * bit faster than SearchCatCache().
+ */
+
+static HeapTuple
+SearchCatCache1e(CatCache *cache,
+                Datum v1)
+{
+    return SearchCatCacheInternale(cache, 1, v1, 0, 0, 0);
+}
+
+
+static HeapTuple
+SearchCatCache2e(CatCache *cache,
+                Datum v1, Datum v2)
+{
+    return SearchCatCacheInternale(cache, 2, v1, v2, 0, 0);
+}
+
+
+static HeapTuple
+SearchCatCache3e(CatCache *cache,
+                Datum v1, Datum v2, Datum v3)
+{
+    return SearchCatCacheInternale(cache, 3, v1, v2, v3, 0);
+}
+
+
+static HeapTuple
+SearchCatCache4e(CatCache *cache,
+                Datum v1, Datum v2, Datum v3, Datum v4)
+{
+    return SearchCatCacheInternale(cache, 4, v1, v2, v3, v4);
+}
+
+/*
+ * Work-horse for SearchCatCachee/SearchCatCacheNe.
+ */
+static inline HeapTuple
+SearchCatCacheInternale(CatCache *cache,
+                        int nkeys,
+                        Datum v1,
+                        Datum v2,
+                        Datum v3,
+                        Datum v4)
+{
+    Datum        arguments[CATCACHE_MAXKEYS];
+    uint32        hashValue;
+    Index        hashIndex;
+    dlist_iter    iter;
+    dlist_head *bucket;
+    CatCTup    *ct;
+
+    /* Make sure we're in an xact, even if this ends up being a cache hit */
+    Assert(IsTransactionState());
+
+    Assert(cache->cc_nkeys == nkeys);
+
+    /*
+     * one-time startup overhead for each cache
+     */
+    if (unlikely(cache->cc_tupdesc == NULL))
+        CatalogCacheInitializeCache(cache);
+
+#ifdef CATCACHE_STATS
+    cache->cc_searches++;
+#endif
+
+    /* Initialize local parameter array */
+    arguments[0] = v1;
+    arguments[1] = v2;
+    arguments[2] = v3;
+    arguments[3] = v4;
+
+    /*
+     * find the hash bucket in which to look for the tuple
+     */
+    hashValue = CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4);
+    hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
+
+    /*
+     * scan the hash bucket until we find a match or exhaust our tuples
+     *
+     * Note: it's okay to use dlist_foreach here, even though we modify the
+     * dlist within the loop, because we don't continue the loop afterwards.
+     */
+    bucket = &cache->cc_bucket[hashIndex];
+    dlist_foreach(iter, bucket)
+    {
+        ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+        if (ct->dead)
+            continue;            /* ignore dead entries */
+
+        if (ct->hash_value != hashValue)
+            continue;            /* quickly skip entry if wrong hash val */
+
+        if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments))
+            continue;
+
+        /*
+         * We found a match in the cache.  Move it to the front of the list
+         * for its hashbucket, in order to speed subsequent searches.  (The
+         * most frequently accessed elements in any hashbucket will tend to be
+         * near the front of the hashbucket's list.)
+         */
+        dlist_move_head(bucket, &ct->cache_elem);
+
+        /*
+         * Prolong life of this entry. Since we want run as less instructions
+         * as possible and want the branch be stable for performance reasons,
+         * we don't give a strict cap on the counter. All numbers above 1 will
+         * be regarded as 2 in CatCacheCleanupOldEntries().
+         */
+        ct->naccess++;
+        if (unlikely(ct->naccess == 0))
+            ct->naccess = 2;
+        ct->lastaccess = catcacheclock;
+
+        /*
+         * If it's a positive entry, bump its refcount and return it. If it's
+         * negative, we can report failure to the caller.
+         */
+        if (!ct->negative)
+        {
+            ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner);
+            ct->refcount++;
+            ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, &ct->tuple);
+
+            CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
+                       cache->cc_relname, hashIndex);
+
+#ifdef CATCACHE_STATS
+            cache->cc_hits++;
+#endif
+
+            return &ct->tuple;
+        }
+        else
+        {
+            CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
+                       cache->cc_relname, hashIndex);
+
+#ifdef CATCACHE_STATS
+            cache->cc_neg_hits++;
+#endif
+
+            return NULL;
+        }
+    }
+
+    return SearchCatCacheMiss(cache, nkeys, hashValue, hashIndex, v1, v2, v3, v4);
+}
+
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int        nremoved = 0;
+    int        i;
+    long    oldest_ts = catcacheclock;
+    long    age;
+    int        us;
+
+    /* Return immediately if disabled */
+    if (catalog_cache_prune_min_age < 0)
+        return false;
+
+    /* Don't scan the hash when we know we don't have prunable entries */
+    TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us);
+    if (age < catalog_cache_prune_min_age)
+        return false;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount == 0 &&
+                (ct->c_list == NULL || ct->c_list->refcount == 0))
+            {
+                /*
+                 * Calculate the duration from the time from the last access
+                 * to the "current" time. catcacheclock is updated
+                 * per-statement basis and additionaly udpated periodically
+                 * during a long running query.
+                 */
+                TimestampDifference(ct->lastaccess, catcacheclock, &age, &us);
+
+                if (age > catalog_cache_prune_min_age)
+                {
+                    /*
+                     * Entries that are not accessed after the last pruning
+                     * are removed in that seconds, and their lives are
+                     * prolonged according to how many times they are accessed
+                     * up to three times of the duration. We don't try shrink
+                     * buckets since pruning effectively caps catcache
+                     * expansion in the long term.
+                     */
+                    if (ct->naccess > 2)
+                        ct->naccess = 1;
+                    else if (ct->naccess > 0)
+                        ct->naccess--;
+                    else
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+
+                        /* don't update oldest_ts by removed entry */
+                        continue;
+                    }
+                }
+            }
+
+            /* update oldest timestamp if the entry remains alive */
+            if (ct->lastaccess < oldest_ts)
+                oldest_ts = ct->lastaccess;
+        }
+    }
+
+    cp->cc_oldest_ts = oldest_ts;
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
+
 /*
  *    ReleaseCatCache
  *
@@ -1925,6 +2232,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
-- 
2.23.0

From 7a793b13803d5defd6d5154a075d1d4cb6826103 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 14 Nov 2019 19:24:36 +0900
Subject: [PATCH] catcache benchmark extension

Provides the function catcachebench(bench_no int), which runs CPU
intensive benchmark on catcache search. The test table is created by a
script separately provided.

catcachebench(0): prewarm catcache with provided test tables.
catcachebench(1): fetches all attribute stats of all tables.
    This benchmark loads a vast number of unique entries.
    Expriration doesn't work since it runs in a transaction.
catcachebench(2): fetches all attribute stats of a tables many times.
    This benchmark repeatedly accesses already loaded entries.
    Expriration doesn't work since it runs in a transaction.
catcachebench(3): fetches all attribute stats of all tables four times.
    Different from other modes, this runs expiration by forcibly
    updates reference clock variable every 1000 entries.

At this point, variables needed for the expiration feature is not
added so SetCatCacheClock is a dummy macro that just replaces it with
its parameter.
---
 contrib/catcachebench/Makefile               |  17 +
 contrib/catcachebench/catcachebench--0.0.sql |  14 +
 contrib/catcachebench/catcachebench.c        | 330 +++++++++++++++++++
 contrib/catcachebench/catcachebench.control  |   6 +
 src/backend/utils/cache/catcache.c           |  35 ++
 src/backend/utils/cache/syscache.c           |   2 +-
 src/include/utils/catcache.h                 |   3 +
 7 files changed, 406 insertions(+), 1 deletion(-)
 create mode 100644 contrib/catcachebench/Makefile
 create mode 100644 contrib/catcachebench/catcachebench--0.0.sql
 create mode 100644 contrib/catcachebench/catcachebench.c
 create mode 100644 contrib/catcachebench/catcachebench.control

diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile
new file mode 100644
index 0000000000..0478818b25
--- /dev/null
+++ b/contrib/catcachebench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = catcachebench
+OBJS = catcachebench.o
+
+EXTENSION = catcachebench
+DATA = catcachebench--0.0.sql
+PGFILEDESC = "catcachebench - benchmark for catcache pruning feature"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/catcachebench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql
new file mode 100644
index 0000000000..ea9cd62abb
--- /dev/null
+++ b/contrib/catcachebench/catcachebench--0.0.sql
@@ -0,0 +1,14 @@
+/* contrib/catcachebench/catcachebench--0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit
+
+CREATE FUNCTION catcachebench(IN type int)
+RETURNS double precision
+AS 'MODULE_PATHNAME', 'catcachebench'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits
bigint)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'catcachereadstats'
+LANGUAGE C STRICT VOLATILE;
diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
new file mode 100644
index 0000000000..b6c2b8f577
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.c
@@ -0,0 +1,330 @@
+/*
+ * catcachebench: test code for cache pruning feature
+ */
+/* #define CATCACHE_STATS */
+#include "postgres.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+
+Oid        tableoids[10000];
+int        ntables = 0;
+int16    attnums[1000];
+int        natts = 0;
+
+PG_MODULE_MAGIC;
+
+double catcachebench1(void);
+double catcachebench2(void);
+double catcachebench3(void);
+void collectinfo(void);
+void catcachewarmup(void);
+
+PG_FUNCTION_INFO_V1(catcachebench);
+PG_FUNCTION_INFO_V1(catcachereadstats);
+
+extern void CatalogCacheFlushCatalog2(Oid catId);
+extern int64 catcache_called;
+extern CatCache *SysCache[];
+
+typedef struct catcachestatsstate
+{
+    TupleDesc tupd;
+    int          catId;
+} catcachestatsstate;
+
+Datum
+catcachereadstats(PG_FUNCTION_ARGS)
+{
+    catcachestatsstate *state_data = NULL;
+    FuncCallContext *fctx;
+
+    if (SRF_IS_FIRSTCALL())
+    {
+        TupleDesc    tupdesc;
+        MemoryContext mctx;
+
+        fctx = SRF_FIRSTCALL_INIT();
+        mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+        state_data = palloc(sizeof(catcachestatsstate));
+
+        /* Build a tuple descriptor for our result type */
+        if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+            elog(ERROR, "return type must be a row type");
+
+        state_data->tupd = tupdesc;
+        state_data->catId = 0;
+
+        fctx->user_fctx = state_data;
+
+        MemoryContextSwitchTo(mctx);
+    }
+
+    fctx = SRF_PERCALL_SETUP();
+    state_data = fctx->user_fctx;
+
+    if (state_data->catId < SysCacheSize)
+    {
+        Datum    values[5];
+        bool    nulls[5];
+        HeapTuple    resulttup;
+        Datum    result;
+        int        catId = state_data->catId++;
+
+        memset(nulls, 0, sizeof(nulls));
+        memset(values, 0, sizeof(values));
+        values[0] = Int16GetDatum(catId);
+        values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid);
+#ifdef CATCACHE_STATS        
+        values[2] = Int64GetDatum(SysCache[catId]->cc_searches);
+        values[3] = Int64GetDatum(SysCache[catId]->cc_hits);
+        values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits);
+#endif
+        resulttup = heap_form_tuple(state_data->tupd, values, nulls);
+        result = HeapTupleGetDatum(resulttup);
+
+        SRF_RETURN_NEXT(fctx, result);
+    }
+
+    SRF_RETURN_DONE(fctx);
+}
+
+Datum
+catcachebench(PG_FUNCTION_ARGS)
+{
+    int        testtype = PG_GETARG_INT32(0);
+    double    ms;
+
+    collectinfo();
+
+    /* flush the catalog -- safe? don't mind. */
+    CatalogCacheFlushCatalog2(StatisticRelationId);
+
+    switch (testtype)
+    {
+    case 0:
+        catcachewarmup(); /* prewarm of syscatalog */
+        PG_RETURN_NULL();
+    case 1:
+        ms = catcachebench1(); break;
+    case 2:
+        ms = catcachebench2(); break;
+    case 3:
+        ms = catcachebench3(); break;
+    default:
+        elog(ERROR, "Invalid test type: %d", testtype);
+    }
+
+    PG_RETURN_DATUM(Float8GetDatum(ms));
+}
+
+/*
+ * fetch all attribute entires of all tables.
+ */
+double
+catcachebench1(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of a table many times.
+ */
+double
+catcachebench2(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < 240000 ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[0]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of all tables several times with having
+ * expiration happen.
+ */
+double
+catcachebench3(void)
+{
+    const int clock_step = 1000;
+    int i, t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (i = 0 ; i < 4 ; i++)
+    {
+        int ct = clock_step;
+
+        for (t = 0 ; t < ntables ; t++)
+        {
+            /*
+             * catcacheclock is updated by transaction timestamp, so needs to
+             * be updated by other means for this test to work. Here I choosed
+             * to update the clock every 1000 tables scan.
+             */
+            if (--ct < 0)
+            {
+                SetCatCacheClock(GetCurrentTimestamp());
+                ct = clock_step;
+            }
+            for (a = 0 ; a < natts ; a++)
+            {
+                HeapTuple tup;
+
+                tup = SearchSysCache3(STATRELATTINH,
+                                      ObjectIdGetDatum(tableoids[t]),
+                                      Int16GetDatum(attnums[a]),
+                                      BoolGetDatum(false));
+                /* should be null, but.. */
+                if (HeapTupleIsValid(tup))
+                    ReleaseSysCache(tup);
+            }
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+void
+catcachewarmup(void)
+{
+    int t, a;
+
+    /* load up catalog tables */
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+}
+
+void
+collectinfo(void)
+{
+    int ret;
+    Datum    values[10000];
+    bool    nulls[10000];
+    Oid        types0[] = {OIDOID};
+    int i;
+
+    ntables = 0;
+    natts = 0;
+
+    SPI_connect();
+    /* collect target tables */
+    ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname =
\'test\')",
+                      true, 0);
+    if (ret != SPI_OK_SELECT)
+        elog(ERROR, "Failed 1");
+    if (SPI_processed == 0)
+        elog(ERROR, "no relation found in schema \"test\"");
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in schema \"test\"");
+
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 2");
+
+        tableoids[ntables++] = DatumGetObjectId(values[0]);
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d tables found", ntables);
+
+    values[0] = ObjectIdGetDatum(tableoids[0]);
+    nulls[0] = false;
+    SPI_connect();
+    ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid
=$1)",
 
+                                1, types0, values, NULL, true, 0);
+    if (SPI_processed == 0)
+        elog(ERROR, "no attribute found in table %d", tableoids[0]);
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in table %d", tableoids[0]);
+    
+    /* collect target attributes. assuming all tables have the same attnums */
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        int16 attnum;
+
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 3");
+        attnum = DatumGetInt16(values[0]);
+
+        if (attnum > 0)
+            attnums[natts++] = attnum;
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d attributes found", natts);
+}
diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control
new file mode 100644
index 0000000000..3fc9d2e420
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.control
@@ -0,0 +1,6 @@
+# catcachebench
+
+comment = 'benchmark for catcache pruning'
+default_version = '0.0'
+module_pathname = '$libdir/catcachebench'
+relocatable = true
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 0c68c04caa..35e1a07e57 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -814,6 +814,41 @@ CatalogCacheFlushCatalog(Oid catId)
     CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
 }
 
+
+/* FUNCTION FOR BENCHMARKING */
+void
+CatalogCacheFlushCatalog2(Oid catId)
+{
+    slist_iter    iter;
+
+    CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId);
+
+    slist_foreach(iter, &CacheHdr->ch_caches)
+    {
+        CatCache   *cache = slist_container(CatCache, cc_next, iter.cur);
+
+        /* Does this cache store tuples of the target catalog? */
+        if (cache->cc_reloid == catId)
+        {
+            /* Yes, so flush all its contents */
+            ResetCatalogCache(cache);
+
+            /* Tell inval.c to call syscache callbacks for this cache */
+            CallSyscacheCallbacks(cache->id, 0);
+
+            cache->cc_nbuckets = 128;
+            pfree(cache->cc_bucket);
+            cache->cc_bucket = palloc0(128 * sizeof(dlist_head));
+            ereport(DEBUG1,
+                    (errmsg("Catcache reset"),
+                     errhidestmt(true)));
+        }
+    }
+
+    CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
+}
+/* END: FUNCTION FOR BENCHMARKING */
+
 /*
  *        InitCatCache
  *
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index 53d9ddf159..1c79a85a8c 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -983,7 +983,7 @@ static const struct cachedesc cacheinfo[] = {
     }
 };
 
-static CatCache *SysCache[SysCacheSize];
+CatCache *SysCache[SysCacheSize];
 
 static bool CacheInitialized = false;
 
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index f9e9889339..dc0ad1a268 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -278,4 +278,7 @@ extern void PrepareToInvalidateCacheTuple(Relation relation,
 extern void PrintCatCacheLeakWarning(HeapTuple tuple);
 extern void PrintCatCacheListLeakWarning(CatCList *list);
 
+/* tentative change to allow benchmark on master branch */
+#define SetCatCacheClock(ts) (ts)
+
 #endif                            /* CATCACHE_H */
-- 
2.23.0

#! /usr/bin/perl
use Expect;

$prune_age = -1;
$warmup_secs = 10;
$interval = 10;

my $exp = Expect->spawn("psql", "postgres")
    or die "cannot execute psql: $!\n";

$exp->log_stdout(0);

#print $exp "set track_catalog_cache_usage_interval to 1000;\n";
#$exp->expect(10, "postgres=#");
print $exp "set catalog_cache_prune_min_age to '${prune_age}s';\n";
$exp->expect(10, "postgres=#");

$starttime = time();
$count = 0;
$mean = 0;
$nexttime = $starttime + $warmup_secs;
$firsttime = 1;


while (1) {
    print $exp "begin; create temp table t1 (a int, b int, c int, d int, e int, f int, g int, h int, i int, j int) on
commitdrop; insert into t1 values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); select * from t1; commit;\n";
 
    $exp->expect(10, "postgres=#");
    $count++;

    if (time() > $nexttime) {
        if ($firsttime) {
            $count = 0;
            $firsttime = 0;
        }
        elsif ($mean == 0) {
            $mean = $count;
        }
        else
        {
            $mean = $mean * 0.9 + $count * 0.1;
        }

        printf STDERR "%6d : %9.2f\n", $count, $mean if ($mean > 0);

        $count = 0;
        $nexttime += $interval;
    }
}

Re: Protect syscache from bloating with negative cache entries

От
Heikki Linnakangas
Дата:
On 19/11/2019 12:48, Kyotaro Horiguchi wrote:
> 1. Inserting a branch in SearchCatCacheInternal. (CatCache_Pattern_1.patch)
> 
>   This is the most straightforward way to add an alternative feature.
> 
> pattern 1 | 8459.73 |  28.15  # 9% (>> 1%) slower than 7757.58
> pattern 1 | 8504.83 |  55.61
> pattern 1 | 8541.81 |  41.56
> pattern 1 | 8552.20 |  27.99
> master    | 7757.58 |  22.65
> master    | 7801.32 |  20.64
> master    | 7839.57 |  25.28
> master    | 7925.30 |  38.84
> 
>   It's so slow that it cannot be used.

This is very surprising. A branch that's never taken ought to be 
predicted by the CPU's branch-predictor, and be very cheap.

Do we actually need a branch there? If I understand correctly, the point 
is to bump up a usage counter on the catcache entry. You could increment 
the counter unconditionally, even if the feature is not used, and avoid 
the branch that way.

Another thought is to bump up the usage counter in ReleaseCatCache(), 
and only when the refcount reaches zero. That might be somewhat cheaper, 
if it's a common pattern to acquire additional leases on an entry that's 
already referenced.

Yet another thought is to replace 'refcount' with an 'acquirecount' and 
'releasecount'. In SearchCatCacheInternal(), increment acquirecount, and 
in ReleaseCatCache, increment releasecount. When they are equal, the 
entry is not in use. Now you have a counter that gets incremented on 
every access, with the same number of CPU instructions in the hot paths 
as we have today.

Or maybe there are some other ways we could micro-optimize 
SearchCatCacheInternal(), to buy back the slowdown that this feature 
would add? For example, you could remove the "if (cl->dead) continue;" 
check, if dead entries were kept out of the hash buckets. Or maybe the 
catctup struct could be made slightly smaller somehow, so that it would 
fit more comfortably in a single cache line.

My point is that I don't think we want to complicate the code much for 
this. All the indirection stuff seems over-engineered for this. Let's 
find a way to keep it simple.

- Heikki



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
Thank you for the comment!

First off, I thought that I managed to eliminate the degradation
observed on the previous versions, but significant degradation (1.1%
slower) is still seen in on case.

Anyway, before sending the new patch, let met just answer for the
comments.

At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> On 19/11/2019 12:48, Kyotaro Horiguchi wrote:
> > 1. Inserting a branch in
> > SearchCatCacheInternal. (CatCache_Pattern_1.patch)
> >   This is the most straightforward way to add an alternative feature.
> > pattern 1 | 8459.73 |  28.15  # 9% (>> 1%) slower than 7757.58
> > pattern 1 | 8504.83 |  55.61
> > pattern 1 | 8541.81 |  41.56
> > pattern 1 | 8552.20 |  27.99
> > master    | 7757.58 |  22.65
> > master    | 7801.32 |  20.64
> > master    | 7839.57 |  25.28
> > master    | 7925.30 |  38.84
> >   It's so slow that it cannot be used.
> 

> This is very surprising. A branch that's never taken ought to be
> predicted by the CPU's branch-predictor, and be very cheap.

(A) original test patch

I naively thought that the code path is too short to bury the
degradation of additional a few instructions.  Actually I measured
performance again with the same patch set on the current master and
had the more or less the same result.

master 8195.58ms, patched 8817.40 ms: +10.75%

However, I noticed that the additional call was a recursive call and a
jmp inserted for the recursive call seems taking significant
time. After avoiding the recursive call, the difference reduced to
+0.96% (master 8268.71ms : patched 8348.30ms)

Just two instructions below are inserted in this case, which looks
reasonable.

  8720ff <+31>:    cmpl   $0xffffffff,0x4ba942(%rip)        # 0xd2ca48 <catalog_cache_prune_min_age>
  872106 <+38>:    jl     0x872240 <SearchCatCache1+352> (call to a function)


(C) inserting bare counter-update code without a branch

> Do we actually need a branch there? If I understand correctly, the
> point is to bump up a usage counter on the catcache entry. You could
> increment the counter unconditionally, even if the feature is not
> used, and avoid the branch that way.

That change causes 4.9% degradation, which is worse than having a
branch.

master 8364.54ms, patched 8666.86ms (+4.9%)

The additional instructions follow.

+ 8721ab <+203>:    mov    0x30(%rbx),%eax  # %eax = ct->naccess
+ 8721ae <+206>:    mov    $0x2,%edx
+ 8721b3 <+211>:    add    $0x1,%eax        # %eax++
+ 8721b6 <+214>:    cmove  %edx,%eax        # if %eax == 0 then %eax = 2
<original code>
+ 8721bf <+223>:    mov    %eax,0x30(%rbx)  # ct->naccess = %eax
+ 8721c2 <+226>:    mov    0x4cfe9f(%rip),%rax        # 0xd42068 <catcacheclock>
+ 8721c9 <+233>:    mov    %rax,0x38(%rbx)  # ct->lastaccess = %rax


(D) naively branching then updateing, again.

Come to think of this, I measured the same with a branch again,
specifically: (It showed siginificant degradation before, in my
memory.)

  dlsit_move_head(bucket, &ct->cache_elem);

+ if (catalog_cache_prune_min_age < -1)  # never be true
+ {
+    (counter update)
+ }

And I had effectively the same numbers from both master and patched.

master 8066.93ms, patched 8052.37ms (-0.18%)

The above branching inserts the same two instructions with (B) into
different place but the result differs, for a reason uncertain to me.

+  8721bb <+203>:    cmpl   $0xffffffff,0x4bb886(%rip)   # <catalog_cache_prune_min_age>
+  8721c2 <+210>:    jl     0x872208 <SearchCatCache1+280>

I'm not sure why but the patched beats the master by a small
difference.  Anyway ths new result shows that compiler might have got
smarter than before?


(E) bumping up in ReleaseCatCache() (won't work)

> Another thought is to bump up the usage counter in ReleaseCatCache(),
> and only when the refcount reaches zero. That might be somewhat
> cheaper, if it's a common pattern to acquire additional leases on an
> entry that's already referenced.
> 
> Yet another thought is to replace 'refcount' with an 'acquirecount'
> and 'releasecount'. In SearchCatCacheInternal(), increment
> acquirecount, and in ReleaseCatCache, increment releasecount. When
> they are equal, the entry is not in use. Now you have a counter that
> gets incremented on every access, with the same number of CPU
> instructions in the hot paths as we have today.

These don't work for negative caches, since the corresponding tuples
are never released.


(F) removing less-significant code.

> Or maybe there are some other ways we could micro-optimize
> SearchCatCacheInternal(), to buy back the slowdown that this feature

Yeah, I thought of that in the beginning. (I removed dlist_move_head()
at the time.)  But the most difficult aspect of this approach is that
I cannot tell whether the modification never cause degradation or not.

> would add? For example, you could remove the "if (cl->dead) continue;"
> check, if dead entries were kept out of the hash buckets. Or maybe the
> catctup struct could be made slightly smaller somehow, so that it
> would fit more comfortably in a single cache line.

As a trial, I removed that code and added the ct->naccess code.

master 8187.44ms, patched 8266.74ms (+1.0%)

So the removal decreased the degradation by about 3.9% of the total
time.

> My point is that I don't think we want to complicate the code much for
> this. All the indirection stuff seems over-engineered for this. Let's
> find a way to keep it simple.

Yes, agreed from the bottom of my heart. I aspire to find a simple way
to avoid degradation.

regars.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
me> First off, I thought that I managed to eliminate the degradation
me> observed on the previous versions, but significant degradation (1.1%
me> slower) is still seen in on case.

While trying benchmarking with many patterns, I noticed that it slows
down catcache search significantly to call CatCacheCleanupOldEntries()
even if the function does almost nothing.  Oddly enough the
degradation gets larger if I removed the counter-updating code from
SearchCatCacheInternal. It seems that RehashCatCache is called far
frequently than I thought and CatCacheCleanupOldEntries was suffering
the branch penalty.

The degradation vanished by a likely() attached to the condition. On
the contrary patched version is constantly slightly faster than
master.

For now, I measured the patch with three access patterns as the
catcachebench was designed.

         master      patched-off         patched-on(300s)
test 1   3898.18ms   3896.11ms (-0.1%)   3889.44ms (-  0.2%)
test 2   8013.37ms   8098.51ms (+1.1%)   8640.63ms (+  7.8%)
test 3   6146.95ms   6147.91ms (+0.0%)  15466   ms (+152  %)

master     : This patch is not applied.
patched-off: This patch is applied and catalog_cache_prune_min_age = -1
patched-on : This patch is applied and catalog_cache_prune_min_age = 0

test 1: Creates many negative entries in STATRELATTINH
        (expiration doesn't happen)
test 2: Repeat fetch several negative entries for many times.
test 3: test 1 with expiration happens.

The result looks far better, but the test 2 still shows a small
degradation... I'll continue investigating it..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 9516267f0e2943cf955cbbfe5133c13c36288ee6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 6 Nov 2020 17:27:18 +0900
Subject: [PATCH v4] CatCache expiration feature

---
 src/backend/access/transam/xact.c  |   3 +
 src/backend/utils/cache/catcache.c | 125 +++++++++++++++++++++++++++++
 src/backend/utils/misc/guc.c       |  12 +++
 src/include/utils/catcache.h       |  20 +++++
 4 files changed, 160 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb1..a246fcc4c0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1086,6 +1086,9 @@ static void
 AtStart_Cache(void)
 {
     AcceptInvalidationMessages();
+
+    if (xactStartTimestamp != 0)
+        SetCatCacheClock(xactStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 3613ae5f44..f63224bfd5 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -60,9 +61,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = -1;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                                                int nkeys,
                                                Datum v1, Datum v2,
@@ -74,6 +84,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 Index hashIndex,
                                                 Datum v1, Datum v2,
                                                 Datum v3, Datum v4);
+static bool CatCacheCleanupOldEntries(CatCache *cp);
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                                            Datum v1, Datum v2, Datum v3, Datum v4);
@@ -99,6 +110,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                              Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    catalog_cache_prune_min_age = newval;
+}
 
 /*
  *                    internal support functions
@@ -863,6 +880,10 @@ RehashCatCache(CatCache *cp)
     int            newnbuckets;
     int            i;
 
+    /* try removing old entries before expanding hash */
+    if (CatCacheCleanupOldEntries(cp))
+        return;
+
     elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets",
          cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets);
 
@@ -1264,6 +1285,20 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /*
+         * Prolong life of this entry. Since we want run as less instructions
+         * as possible and want the branch be stable for performance reasons,
+         * we don't give a strict cap on the counter. All numbers above 1 will
+         * be regarded as 2 in CatCacheCleanupOldEntries().
+         */
+        if (unlikely(catalog_cache_prune_min_age >= 0))
+        {
+            ct->naccess++;
+            if (unlikely(ct->naccess == 0))
+                ct->naccess = 2;
+            ct->lastaccess = catcacheclock;
+        }
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1425,6 +1460,94 @@ SearchCatCacheMiss(CatCache *cache,
     return &ct->tuple;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int        nremoved = 0;
+    int        i;
+    long    oldest_ts = catcacheclock;
+    long    age;
+    int        us;
+
+    /* Return immediately if disabled */
+    if (likely(catalog_cache_prune_min_age < 0))
+        return false;
+
+    /* Don't scan the hash when we know we don't have prunable entries */
+    TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us);
+    if (age < catalog_cache_prune_min_age)
+        return false;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount == 0 &&
+                (ct->c_list == NULL || ct->c_list->refcount == 0))
+            {
+                /*
+                 * Calculate the duration from the time from the last access
+                 * to the "current" time. catcacheclock is updated
+                 * per-statement basis and additionaly udpated periodically
+                 * during a long running query.
+                 */
+                TimestampDifference(ct->lastaccess, catcacheclock, &age, &us);
+
+                if (age > catalog_cache_prune_min_age)
+                {
+                    /*
+                     * Entries that are not accessed after the last pruning
+                     * are removed in that seconds, and their lives are
+                     * prolonged according to how many times they are accessed
+                     * up to three times of the duration. We don't try shrink
+                     * buckets since pruning effectively caps catcache
+                     * expansion in the long term.
+                     */
+                    if (ct->naccess > 2)
+                        ct->naccess = 1;
+                    else if (ct->naccess > 0)
+                        ct->naccess--;
+                    else
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+
+                        /* don't update oldest_ts by removed entry */
+                        continue;
+                    }
+                }
+            }
+
+            /* update oldest timestamp if the entry remains alive */
+            if (ct->lastaccess < oldest_ts)
+                oldest_ts = ct->lastaccess;
+        }
+    }
+
+    cp->cc_oldest_ts = oldest_ts;
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  *    ReleaseCatCache
  *
@@ -1888,6 +2011,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 0;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa4..ca897cab2e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -88,6 +88,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
@@ -3399,6 +3400,17 @@ static struct config_int ConfigureNamesInt[] =
         check_huge_page_size, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that are living unused more than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        -1, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index f4aa316604..a11736f767 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    TimestampTz    cc_oldest_ts;    /* timestamp of the oldest tuple in the hash */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    unsigned int naccess;        /* # of access to this entry */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +193,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.18.4


Re: Protect syscache from bloating with negative cache entries

От
Heikki Linnakangas
Дата:
On 06/11/2020 10:24, Kyotaro Horiguchi wrote:
> Thank you for the comment!
> 
> First off, I thought that I managed to eliminate the degradation
> observed on the previous versions, but significant degradation (1.1%
> slower) is still seen in on case.

One thing to keep in mind with micro-benchmarks like this is that even 
completely unrelated code changes can change the layout of the code in 
memory, which in turn can affect CPU caching affects in surprising ways. 
If you're lucky, you can see 1-5% differences just by adding a function 
that's never called, for example, if it happens to move other code in 
memory so that a some hot codepath or struct gets split across CPU cache 
lines. It can be infuriating when benchmarking.

> At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
> (A) original test patch
> 
> I naively thought that the code path is too short to bury the
> degradation of additional a few instructions.  Actually I measured
> performance again with the same patch set on the current master and
> had the more or less the same result.
> 
> master 8195.58ms, patched 8817.40 ms: +10.75%
> 
> However, I noticed that the additional call was a recursive call and a
> jmp inserted for the recursive call seems taking significant
> time. After avoiding the recursive call, the difference reduced to
> +0.96% (master 8268.71ms : patched 8348.30ms)
> 
> Just two instructions below are inserted in this case, which looks
> reasonable.
> 
>    8720ff <+31>:    cmpl   $0xffffffff,0x4ba942(%rip)        # 0xd2ca48 <catalog_cache_prune_min_age>
>    872106 <+38>:    jl     0x872240 <SearchCatCache1+352> (call to a function)

That's interesting. I think a 1% degradation would be acceptable.

I think we'd like to enable this feature by default though, so the 
performance when it's enabled is also very important.

> (C) inserting bare counter-update code without a branch
> 
>> Do we actually need a branch there? If I understand correctly, the
>> point is to bump up a usage counter on the catcache entry. You could
>> increment the counter unconditionally, even if the feature is not
>> used, and avoid the branch that way.
> 
> That change causes 4.9% degradation, which is worse than having a
> branch.
> 
> master 8364.54ms, patched 8666.86ms (+4.9%)
> 
> The additional instructions follow.
> 
> + 8721ab <+203>:    mov    0x30(%rbx),%eax  # %eax = ct->naccess
> + 8721ae <+206>:    mov    $0x2,%edx
> + 8721b3 <+211>:    add    $0x1,%eax        # %eax++
> + 8721b6 <+214>:    cmove  %edx,%eax        # if %eax == 0 then %eax = 2
> <original code>
> + 8721bf <+223>:    mov    %eax,0x30(%rbx)  # ct->naccess = %eax
> + 8721c2 <+226>:    mov    0x4cfe9f(%rip),%rax        # 0xd42068 <catcacheclock>
> + 8721c9 <+233>:    mov    %rax,0x38(%rbx)  # ct->lastaccess = %rax

Do you need the "ntaccess == 2" test? You could always increment the 
counter, and in the code that uses ntaccess to decide what to evict, 
treat all values >= 2 the same.

Need to handle integer overflow somehow. Or maybe not: integer overflow 
is so infrequent that even if a hot syscache entry gets evicted 
prematurely because its ntaccess count wrapped around to 0, it will 
happen so rarely that it won't make any difference in practice.

- Heikki



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> On 06/11/2020 10:24, Kyotaro Horiguchi wrote:
> > Thank you for the comment!
> > First off, I thought that I managed to eliminate the degradation
> > observed on the previous versions, but significant degradation (1.1%
> > slower) is still seen in on case.
> 
> One thing to keep in mind with micro-benchmarks like this is that even
> completely unrelated code changes can change the layout of the code in
> memory, which in turn can affect CPU caching affects in surprising
> ways. If you're lucky, you can see 1-5% differences just by adding a
> function that's never called, for example, if it happens to move other
> code in memory so that a some hot codepath or struct gets split across
> CPU cache lines. It can be infuriating when benchmarking.

True.  I sometimes had to make distclean to stabilize such benchmarks..

> > At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas
> > <hlinnaka@iki.fi> wrote in
> > (A) original test patch
> > I naively thought that the code path is too short to bury the
> > degradation of additional a few instructions.  Actually I measured
> > performance again with the same patch set on the current master and
> > had the more or less the same result.
> > master 8195.58ms, patched 8817.40 ms: +10.75%
> > However, I noticed that the additional call was a recursive call and a
> > jmp inserted for the recursive call seems taking significant
> > time. After avoiding the recursive call, the difference reduced to
> > +0.96% (master 8268.71ms : patched 8348.30ms)
> > Just two instructions below are inserted in this case, which looks
> > reasonable.
> >    8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48
> >    <catalog_cache_prune_min_age>
> >    872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function)
> 
> That's interesting. I think a 1% degradation would be acceptable.
> 
> I think we'd like to enable this feature by default though, so the
> performance when it's enabled is also very important.
> 
> > (C) inserting bare counter-update code without a branch
> > 
> >> Do we actually need a branch there? If I understand correctly, the
> >> point is to bump up a usage counter on the catcache entry. You could
> >> increment the counter unconditionally, even if the feature is not
> >> used, and avoid the branch that way.
> > That change causes 4.9% degradation, which is worse than having a
> > branch.
> > master 8364.54ms, patched 8666.86ms (+4.9%)
> > The additional instructions follow.
> > + 8721ab <+203>:    mov    0x30(%rbx),%eax  # %eax = ct->naccess
> > + 8721ae <+206>:    mov    $0x2,%edx
> > + 8721b3 <+211>:    add    $0x1,%eax        # %eax++
> > + 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2
> > <original code>
> > + 8721bf <+223>:    mov    %eax,0x30(%rbx)  # ct->naccess = %eax
> > + 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock>
> > + 8721c9 <+233>:    mov    %rax,0x38(%rbx)  # ct->lastaccess = %rax
> 
> Do you need the "ntaccess == 2" test? You could always increment the
> counter, and in the code that uses ntaccess to decide what to evict,
> treat all values >= 2 the same.
> 
> Need to handle integer overflow somehow. Or maybe not: integer
> overflow is so infrequent that even if a hot syscache entry gets
> evicted prematurely because its ntaccess count wrapped around to 0, it
> will happen so rarely that it won't make any difference in practice.

Agreed. Ok, I have prioritized completely avoiding degradation on the
normal path, but laxing that restriction to 1% or so makes the code
far simpler and make the expiration path signifinicantly faster.

Now the branch for counter-increment is removed.  For similar
branches for counter-decrement side in CatCacheCleanupOldEntries(),
Min() is compiled into cmovbe and a branch was removed.





Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
At Mon, 09 Nov 2020 11:13:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Now the branch for counter-increment is removed.  For similar
> branches for counter-decrement side in CatCacheCleanupOldEntries(),
> Min() is compiled into cmovbe and a branch was removed.

Mmm. Sorry, I sent this by a mistake. Please ignore it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> Do you need the "ntaccess == 2" test? You could always increment the
> counter, and in the code that uses ntaccess to decide what to evict,
> treat all values >= 2 the same.
> 
> Need to handle integer overflow somehow. Or maybe not: integer
> overflow is so infrequent that even if a hot syscache entry gets
> evicted prematurely because its ntaccess count wrapped around to 0, it
> will happen so rarely that it won't make any difference in practice.

That relaxing simplifies the code significantly, but a significant
degradation by about 5% still exists.

(SearchCatCacheInternal())
 +     ct->naccess++;
!+     ct->lastaccess = catcacheclock;

If I removed the second line above, the degradation disappears
(-0.7%). However, I don't find the corresponding numbers in the output
of perf. The sum of the numbers for the removed instructions is (0.02
+ 0.28 = 0.3%).  I don't think the degradation as the whole doesn't
always reflect to the instruction level profiling, but I'm stuck here,
anyway.

     % samples
master  p2    patched    (p2 = patched - "ct->lastaccess = catcacheclock)
=============================================================================
 0.47 | 0.27 |  0.17 |       mov   %rbx,0x8(%rbp)
      |      |       |     SearchCatCacheInternal():
      |      |       |     ct->naccess++;
      |      |       |     ct->lastaccess = catcacheclock;
----- |----- |  0.02 |10f:   mov   catcacheclock,%rax
      |      |       |     ct->naccess++;
----- | 0.96 |  1.00 |       addl  $0x1,0x14(%rbx)
      |      |       |     return NULL;
----- | 0.11 |  0.16 |       xor   %ebp,%ebp
      |      |       |     if (!ct->negative)
 0.27 | 0.30 |  0.03 |       cmpb  $0x0,0x21(%rbx)
      |      |       |     ct->lastaccess = catcacheclock;
----- | ---- |  0.28 |       mov   %rax,0x18(%rbx)
      |      |       |     if (!ct->negative)
 0.34 | 0.08 |  0.59 |     ↓ jne   149




For your information, the same table for a bit wider range follows.

     % samples
master  p2    patched    (p2 = patched - "ct->lastaccess = catcacheclock)
=============================================================================
      |      |       |     dlist_foreach(iter, bucket)
 6.91 | 7.06 |  5.89 |       mov   0x8(%rbp),%rbx
 0.78 | 0.73 |  0.81 |       test  %rbx,%rbx
      |      |       |     ↓ je    160
      |      |       |       cmp   %rbx,%rbp
 0.46 | 0.52 |  0.39 |     ↓ jne   9d
      |      |       |     ↓ jmpq  160
      |      |       |       nop
 5.68 | 5.54 |  6.03 | 90:   mov   0x8(%rbx),%rbx
 1.44 | 1.42 |  1.43 |       cmp   %rbx,%rbp
      |      |       |     ↓ je    160
      |      |       |     {
      |      |       |     ct = dlist_container(CatCTup, cache_elem, iter.cur);
      |      |       |
      |      |       |     if (ct->dead)
30.36 |30.97 | 31.48 | 9d:   cmpb  $0x0,0x20(%rbx)
 2.63 | 2.60 |  2.69 |     ↑ jne   90
      |      |       |     continue;                       /* ignore dead entries */
      |      |       |
      |      |       |     if (ct->hash_value != hashValue)
 1.41 | 1.37 |  1.35 |       cmp   -0x24(%rbx),%edx
 3.19 | 2.97 |  2.87 |     ↑ jne   90
 7.17 | 5.53 |  6.89 |       mov   %r13,%rsi
 0.02 | 0.04 |  0.04 |       xor   %r12d,%r12d
 3.00 | 2.98 |  2.95 |     ↓ jmp   b5
 0.15 | 0.61 |  0.20 | b0:   mov   0x10(%rsp,%r12,1),%rsi
 6.58 | 5.04 |  5.95 | b5:   mov   %ecx,0xc(%rsp)
      |      |       |     CatalogCacheCompareTuple():
      |      |       |     if (!(cc_fastequal[i]) (cachekeys[i], searchkeys[i]))
 1.51 | 0.92 |  1.66 |       mov   -0x20(%rbx,%r12,1),%rdi
 0.54 | 1.64 |  0.58 |       mov   %edx,0x8(%rsp)
 3.78 | 3.11 |  3.86 |     → callq *0x38(%r14,%r12,1)
 0.43 | 2.30 |  0.34 |       mov   0x8(%rsp),%edx
 0.20 | 0.94 |  0.25 |       mov   0xc(%rsp),%ecx
 0.44 | 0.41 |  0.44 |       test  %al,%al
      |      |       |     ↑ je    90
      |      |       |     for (i = 0; i < nkeys; i++)
 2.28 | 1.07 |  2.26 |       add   $0x8,%r12
 0.08 | 0.23 |  0.07 |       cmp   $0x18,%r12
 0.11 | 0.64 |  0.10 |     ↑ jne   b0
      |      |       |     dlist_move_head():
      |      |       |     */
      |      |       |     static inline void
      |      |       |     dlist_move_head(dlist_head *head, dlist_node *node)
      |      |       |     {
      |      |       |     /* fast path if it's already at the head */
      |      |       |     if (head->head.next == node)
 0.08 | 0.61 |  0.04 |       cmp   0x8(%rbp),%rbx
 0.02 | 0.10 |  0.00 |     ↓ je    10f
      |      |       |     return;
      |      |       |
      |      |       |     dlist_delete(node);
 0.01 | 0.20 |  0.06 |       mov   0x8(%rbx),%rax
      |      |       |     dlist_delete():
      |      |       |     node->prev->next = node->next;
 0.75 | 0.13 |  0.72 |       mov   (%rbx),%rdx
 2.89 | 3.42 |  2.22 |       mov   %rax,0x8(%rdx)
      |      |       |     node->next->prev = node->prev;
 0.01 | 0.09 |  0.00 |       mov   (%rbx),%rdx
 0.04 | 0.62 |  0.58 |       mov   %rdx,(%rax)
      |      |       |     dlist_push_head():
      |      |       |     if (head->head.next == NULL)    /* convert NULL header to circular */
 0.31 | 0.08 |  0.28 |       mov   0x8(%rbp),%rax
 0.55 | 0.44 |  0.28 |       test  %rax,%rax
      |      |       |     ↓ je    180
      |      |       |     node->next = head->head.next;
 0.00 | 0.08 |  0.06 |101:   mov   %rax,0x8(%rbx)
      |      |       |     node->prev = &head->head;
 0.17 | 0.73 |  0.37 |       mov   %rbp,(%rbx)
      |      |       |     node->next->prev = node;
 0.34 | 0.08 |  1.13 |       mov   %rbx,(%rax)
      |      |       |     head->head.next = node;
 0.47 | 0.27 |  0.17 |       mov   %rbx,0x8(%rbp)
      |      |       |     SearchCatCacheInternal():
      |      |       |     ct->naccess++;
      |      |       |     ct->lastaccess = catcacheclock;
----- |----- |  0.02 |10f:   mov   catcacheclock,%rax
      |      |       |     ct->naccess++;
----- | 0.96 |  1.00 |       addl  $0x1,0x14(%rbx)
      |      |       |     return NULL;
----- | 0.11 |  0.16 |       xor   %ebp,%ebp
      |      |       |     if (!ct->negative)
 0.27 | 0.30 |  0.03 |       cmpb  $0x0,0x21(%rbx)
      |      |       |     ct->lastaccess = catcacheclock;
----- | ---- |  0.28 |       mov   %rax,0x18(%rbx)
      |      |       |     if (!ct->negative)
 0.34 | 0.08 |  0.59 |     ↓ jne   149

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 498a55ff07f19646ca09034dfdc4c68459a74855 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 6 Nov 2020 17:27:18 +0900
Subject: [PATCH v5] CatCache expiration feature

---
 src/backend/access/transam/xact.c  |   3 +
 src/backend/utils/cache/catcache.c | 118 +++++++++++++++++++++++++++++
 src/backend/utils/misc/guc.c       |  12 +++
 src/include/utils/catcache.h       |  20 +++++
 4 files changed, 153 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb1..a246fcc4c0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1086,6 +1086,9 @@ static void
 AtStart_Cache(void)
 {
     AcceptInvalidationMessages();
+
+    if (xactStartTimestamp != 0)
+        SetCatCacheClock(xactStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 3613ae5f44..b457fed7ab 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -60,9 +61,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = -1;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                                                int nkeys,
                                                Datum v1, Datum v2,
@@ -74,6 +84,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 Index hashIndex,
                                                 Datum v1, Datum v2,
                                                 Datum v3, Datum v4);
+static bool CatCacheCleanupOldEntries(CatCache *cp);
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                                            Datum v1, Datum v2, Datum v3, Datum v4);
@@ -99,6 +110,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                              Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    catalog_cache_prune_min_age = newval;
+}
 
 /*
  *                    internal support functions
@@ -863,6 +880,10 @@ RehashCatCache(CatCache *cp)
     int            newnbuckets;
     int            i;
 
+    /* try removing old entries before expanding hash */
+    if (CatCacheCleanupOldEntries(cp))
+        return;
+
     elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets",
          cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets);
 
@@ -1264,6 +1285,16 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /*
+         * Prolong life of this entry. Since we want run as less instructions
+         * as possible and want the branch be stable for performance reasons,
+         * we don't care of wrap-around and possible false-negative for old
+         * entries. The window is quite narrow and the counter doesn't gets so
+         * large while expiration is active.
+         */
+        ct->naccess++;
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1425,6 +1456,91 @@ SearchCatCacheMiss(CatCache *cache,
     return &ct->tuple;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int        nremoved = 0;
+    int        i;
+    long    oldest_ts = catcacheclock;
+    long    age;
+    int        us;
+
+    /* Return immediately if disabled */
+    if (likely(catalog_cache_prune_min_age < 0))
+        return false;
+
+    /* Don't scan the hash when we know we don't have prunable entries */
+    TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us);
+    if (age < catalog_cache_prune_min_age)
+        return false;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount == 0 &&
+                (ct->c_list == NULL || ct->c_list->refcount == 0))
+            {
+                /*
+                 * Calculate the duration from the time from the last access
+                 * to the "current" time. catcacheclock is updated
+                 * per-statement basis and additionaly udpated periodically
+                 * during a long running query.
+                 */
+                TimestampDifference(ct->lastaccess, catcacheclock, &age, &us);
+
+                if (age > catalog_cache_prune_min_age)
+                {
+                    /*
+                     * Entries that are not accessed after the last pruning
+                     * are removed in that seconds, and their lives are
+                     * prolonged according to how many times they are accessed
+                     * up to three times of the duration. We don't try shrink
+                     * buckets since pruning effectively caps catcache
+                     * expansion in the long term.
+                     */
+                    ct->naccess = Min(2, ct->naccess);
+                    if (--ct->naccess == 0)
+                    {
+                        CatCacheRemoveCTup(cp, ct);
+                        nremoved++;
+
+                        /* don't update oldest_ts by removed entry */
+                        continue;
+                    }
+                }
+            }
+
+            /* update oldest timestamp if the entry remains alive */
+            if (ct->lastaccess < oldest_ts)
+                oldest_ts = ct->lastaccess;
+        }
+    }
+
+    cp->cc_oldest_ts = oldest_ts;
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  *    ReleaseCatCache
  *
@@ -1888,6 +2004,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->naccess = 1;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e..95213853aa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -88,6 +88,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
@@ -3399,6 +3400,17 @@ static struct config_int ConfigureNamesInt[] =
         check_huge_page_size, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that are living unused more than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        -1, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index f4aa316604..a11736f767 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    TimestampTz    cc_oldest_ts;    /* timestamp of the oldest tuple in the hash */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,8 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    unsigned int naccess;        /* # of access to this entry */
+    TimestampTz    lastaccess;        /* timestamp of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +193,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern TimestampTz catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clodk */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = ts;
+}
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.18.4


Re: Protect syscache from bloating with negative cache entries

От
Heikki Linnakangas
Дата:
On 09/11/2020 11:34, Kyotaro Horiguchi wrote:
> At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
>> Do you need the "ntaccess == 2" test? You could always increment the
>> counter, and in the code that uses ntaccess to decide what to evict,
>> treat all values >= 2 the same.
>>
>> Need to handle integer overflow somehow. Or maybe not: integer
>> overflow is so infrequent that even if a hot syscache entry gets
>> evicted prematurely because its ntaccess count wrapped around to 0, it
>> will happen so rarely that it won't make any difference in practice.
> 
> That relaxing simplifies the code significantly, but a significant
> degradation by about 5% still exists.
> 
> (SearchCatCacheInternal())
>   +     ct->naccess++;
> !+     ct->lastaccess = catcacheclock;
> 
> If I removed the second line above, the degradation disappears
> (-0.7%).

0.7% degradation is probably acceptable.

> However, I don't find the corresponding numbers in the output
> of perf. The sum of the numbers for the removed instructions is (0.02
> + 0.28 = 0.3%).  I don't think the degradation as the whole doesn't
> always reflect to the instruction level profiling, but I'm stuck here,
> anyway.

Hmm. Some kind of cache miss effect, perhaps? offsetof(CatCTup, tuple) 
is exactly 64 bytes currently, so any fields that you add after 'tuple' 
will go on a different cache line. Maybe it would help if you just move 
the new fields before 'tuple'.

Making CatCTup smaller might help. Some ideas/observations:

- The 'ct_magic' field is only used for assertion checks. Could remove it.

- 4 Datums (32 bytes) are allocated for the keys, even though most 
catcaches have fewer key columns.

- In the current syscaches, keys[2] and keys[3] are only used to store 
32-bit oids or some other smaller fields. Allocating a full 64-bit Datum 
for them wastes memory.

- You could move the dead flag at the end of the struct or remove it 
altogether, with the change I mentioned earlier to not keep dead items 
in the buckets

- You could steal a few bit for dead/negative flags from some other 
field. Use special values for tuple.t_len for them or something.

With some of these tricks, you could shrink CatCTup so that the new 
lastaccess and naccess fields would fit in the same cacheline.

That said, I think this is good enough performance-wise as it is. So if 
we want to improve performance in general, that can be a separate patch.

- Heikki



Re: Protect syscache from bloating with negative cache entries

От
Robert Haas
Дата:
On Tue, Nov 17, 2020 at 10:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> 0.7% degradation is probably acceptable.

I haven't looked at this patch in a while and I'm pleased with the way
it seems to have been redesigned. It seems relatively simple and
unlikely to cause big headaches. I would say that 0.7% is probably not
acceptable on a general workload, but it seems fine on a benchmark
that is specifically designed to be a worst-case for this patch, which
I gather is what's happening here. I think it would be nice if we
could enable this feature by default. Does it cause a measurable
regression on realistic workloads when enabled? I bet a default of 5
or 10 minutes would help many users.

One idea for improving things might be to move the "return
immediately" tests in CatCacheCleanupOldEntries() to the caller, and
only call this function if they indicate that there is some purpose.
This would avoid the function call overhead when nothing can be done.
Perhaps the two tests could be combined into one and simplified. Like,
suppose the code looks (roughly) like this:

if (catcacheclock >= time_at_which_we_can_prune)
    CatCacheCleanupOldEntries(...);

To make it that simple, we want catcacheclock and
time_at_which_we_can_prune to be stored as bare uint64 quantities so
we don't need TimestampDifference(). And we want
time_at_which_we_can_prune to be set to PG_UINT64_MAX when the feature
is disabled. But those both seem like pretty achievable things... and
it seems like the result would probably be faster than what you have
now.

+ * per-statement basis and additionaly udpated periodically

two words spelled wrong

+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+ catalog_cache_prune_min_age = newval;
+}

hmm, do we need this?

+ /*
+ * Entries that are not accessed after the last pruning
+ * are removed in that seconds, and their lives are
+ * prolonged according to how many times they are accessed
+ * up to three times of the duration. We don't try shrink
+ * buckets since pruning effectively caps catcache
+ * expansion in the long term.
+ */
+ ct->naccess = Min(2, ct->naccess);

The code doesn't match the comment, it seems, because the limit here
is 2, not 3. I wonder if this does anything anyway. My intuition is
that when a catcache entry gets accessed at all it's probably likely
to get accessed a bunch of times. If there are any meaningful
thresholds here I'd expect us to be trying to distinguish things like
1000+ accesses vs. 100-1000 vs. 10-100 vs. 1-10. Or maybe we don't
need to distinguish at all and can just have a single mark bit rather
than a counter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
At Tue, 17 Nov 2020 17:46:25 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> On 09/11/2020 11:34, Kyotaro Horiguchi wrote:
> > At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi>
> > wrote in
> >> Do you need the "ntaccess == 2" test? You could always increment the
> >> counter, and in the code that uses ntaccess to decide what to evict,
> >> treat all values >= 2 the same.
> >>
> >> Need to handle integer overflow somehow. Or maybe not: integer
> >> overflow is so infrequent that even if a hot syscache entry gets
> >> evicted prematurely because its ntaccess count wrapped around to 0, it
> >> will happen so rarely that it won't make any difference in practice.
> > That relaxing simplifies the code significantly, but a significant
> > degradation by about 5% still exists.
> > (SearchCatCacheInternal())
> >   +     ct->naccess++;
> > !+     ct->lastaccess = catcacheclock;
> > If I removed the second line above, the degradation disappears
> > (-0.7%).
> 
> 0.7% degradation is probably acceptable.

Sorry for the confusion "-0.7% degradation" meant "+0.7% gain".

> > However, I don't find the corresponding numbers in the output
> > of perf. The sum of the numbers for the removed instructions is (0.02
> > + 0.28 = 0.3%).  I don't think the degradation as the whole doesn't
> > always reflect to the instruction level profiling, but I'm stuck here,
> > anyway.
> 
> Hmm. Some kind of cache miss effect, perhaps? offsetof(CatCTup, tuple) is

Shouldn't it be seen in the perf result?

> exactly 64 bytes currently, so any fields that you add after 'tuple' will go
> on a different cache line. Maybe it would help if you just move the new fields
> before 'tuple'.
> 
> Making CatCTup smaller might help. Some ideas/observations:
> 
> - The 'ct_magic' field is only used for assertion checks. Could remove it.

Ok, removed.

> - 4 Datums (32 bytes) are allocated for the keys, even though most catcaches
> - have fewer key columns.
> - In the current syscaches, keys[2] and keys[3] are only used to store 32-bit
> - oids or some other smaller fields. Allocating a full 64-bit Datum for them
> - wastes memory.

It seems to be the last resort.

> - You could move the dead flag at the end of the struct or remove it
> - altogether, with the change I mentioned earlier to not keep dead items in
> - the buckets

This seems most promising so I did this.  One annoyance is we need to
know whether a catcache tuple is invalidated or not to judge whether
to remove it.  I used CatCtop.cache_elem.prev to signal the same in
the next version.

> - You could steal a few bit for dead/negative flags from some other field. Use
> - special values for tuple.t_len for them or something.

I stealed the MSB of refcount as negative, but the bit masking
operations seems making the function slower.  The benchmark-2 gets
slower by around +2% as the total.

> With some of these tricks, you could shrink CatCTup so that the new lastaccess
> and naccess fields would fit in the same cacheline.
> 
> That said, I think this is good enough performance-wise as it is. So if we
> want to improve performance in general, that can be a separate patch.

Removing CatCTup.dead increased the performance of catcache search
significantly, but catcache entry creation gets slower for uncertain
rasons..

(Continues to a reply to Robert's comment)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
Thank you for the comments.

At Tue, 17 Nov 2020 16:22:54 -0500, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Tue, Nov 17, 2020 at 10:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > 0.7% degradation is probably acceptable.
> 
> I haven't looked at this patch in a while and I'm pleased with the way
> it seems to have been redesigned. It seems relatively simple and
> unlikely to cause big headaches. I would say that 0.7% is probably not
> acceptable on a general workload, but it seems fine on a benchmark

Sorry for the confusing notation, "-0.7% degradation" meant +0.7%
*gain*, which I thinks is error.  However, the next patch makes
catcache apparently *faster* so the difference doesn't matter..


> that is specifically designed to be a worst-case for this patch, which
> I gather is what's happening here. I think it would be nice if we
> could enable this feature by default. Does it cause a measurable
> regression on realistic workloads when enabled? I bet a default of 5
> or 10 minutes would help many users.
> 
> One idea for improving things might be to move the "return
> immediately" tests in CatCacheCleanupOldEntries() to the caller, and
> only call this function if they indicate that there is some purpose.
> This would avoid the function call overhead when nothing can be done.
> Perhaps the two tests could be combined into one and simplified. Like,
> suppose the code looks (roughly) like this:
> 
> if (catcacheclock >= time_at_which_we_can_prune)
>     CatCacheCleanupOldEntries(...);

Compiler removes the call (or inlines the function) but of course we
can write that way and it shows the condition for calling the function
better.  The codelet above forgetting consideration on the result of
CatCacheCleanupOldEntries() itself.  The function returns false when
all "old" entries have been invalidated or explicitly removed and we
need to expand the hash in that case.

> To make it that simple, we want catcacheclock and
> time_at_which_we_can_prune to be stored as bare uint64 quantities so
> we don't need TimestampDifference(). And we want
> time_at_which_we_can_prune to be set to PG_UINT64_MAX when the feature
> is disabled. But those both seem like pretty achievable things... and
> it seems like the result would probably be faster than what you have
> now.

The time_at_which_we_can_prune is not global but catcache-local and
needs to change at the time catalog_cache_prune_min_age is changed.

So the next version does as the follwoing:

-    if (CatCacheCleanupOldEntries(cp))
+    if (catcacheclock - cp->cc_oldest_ts > prune_min_age_us &&
+        CatCacheCleanupOldEntries(cp))

On the other hand CatCacheCleanupOldEntries can calcualte the
time_at_which_we_can_prune once at the beginning of the function. That
makes the condition in the loop simpler.

-        TimestampDifference(ct->lastaccess, catcacheclock, &age, &us);
-
-        if (age > catalog_cache_prune_min_age)
+        if (ct->lastaccess < prune_threshold)
        {

> + * per-statement basis and additionaly udpated periodically
> 
> two words spelled wrong

Ugg. Fixed.  Checked all spellings and found another misspelling.

> +void
> +assign_catalog_cache_prune_min_age(int newval, void *extra)
> +{
> + catalog_cache_prune_min_age = newval;
> +}
> 
> hmm, do we need this?

*That* is actually useless, but the function is kept and not it
maintains the internal-version of the GUC parameter (uint64
prune_min_age).

> + /*
> + * Entries that are not accessed after the last pruning
> + * are removed in that seconds, and their lives are
> + * prolonged according to how many times they are accessed
> + * up to three times of the duration. We don't try shrink
> + * buckets since pruning effectively caps catcache
> + * expansion in the long term.
> + */
> + ct->naccess = Min(2, ct->naccess);
> 
> The code doesn't match the comment, it seems, because the limit here
> is 2, not 3. I wonder if this does anything anyway. My intuition is
> that when a catcache entry gets accessed at all it's probably likely
> to get accessed a bunch of times. If there are any meaningful
> thresholds here I'd expect us to be trying to distinguish things like
> 1000+ accesses vs. 100-1000 vs. 10-100 vs. 1-10. Or maybe we don't
> need to distinguish at all and can just have a single mark bit rather
> than a counter.

Agreed. Since I don't see a clear criteria for the threshold of the
counter, I removed the naccess and related lines.

I did the following changes in the attached.

1. Removed naccess and related lines.

2. Moved the precheck condition out of CatCacheCleanupOldEntries() to
  RehashCatCache().

3. Use uint64 direct comparison instead of TimestampDifference().

4. Removed CatCTup.dead flag.

Performance measurement on the attached showed better result about
searching but maybe worse for cache entry creation.  Each time number
is the mean of 10 runs.

# Cacache (negative) entry creation
           :  time(ms) (% to master)
master     :  3965.61    (100.0)
patched-off:  4040.93    (101.9)
patched-on :  4032.22    (101.7)

# Searching negative cache entries
master     :  8173.46    (100.0)
patched-off:  7983.43    ( 97.7)
patched-on :  8049.88    ( 98.5)

# Creation, searching and expiration
master     :  6393.23    (100.0)
patched-off:  6527.94    (102.1)
patched-on : 15880.01    (248.4)


That is, catcache searching gets faster by 2-3% but creation gets
slower by about 2%. If I moved the condition of 2 further up to
CatalogCacheCreateEntry(), that degradation reduced to 0.6%.

# Cacache (negative) entry creation
master      :  3967.45   (100.0)
patched-off :  3990.43   (100.6)
patched-on  :  4108.96   (103.6)

# Searching negative cache entries
master      :  8106.53   (100.0)
patched-off :  8036.61   ( 99.1)
patched-on  :  8058.18   ( 99.4)

# Creation, searching and expiration
master      :  6395.00   (100.0)
patched-off :  6416.57   (100.3)
patched-on  : 15830.91   (247.6)

It doesn't get smaller if I reverted the changed lines in
CatalogCacheCreateEntry()..


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 990514a853ad92b2d929cc026724194831ef8793 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 18 Nov 2020 16:54:31 +0900
Subject: [PATCH v5 1/3] CatCache expiration feature

---
 src/backend/access/transam/xact.c  |  3 ++
 src/backend/utils/cache/catcache.c | 87 +++++++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c       | 12 +++++
 src/include/utils/catcache.h       | 19 +++++++
 4 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 03c553e7ea..4a2a90ce0c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1086,6 +1086,9 @@ static void
 AtStart_Cache(void)
 {
     AcceptInvalidationMessages();
+
+    if (xactStartTimestamp != 0)
+        SetCatCacheClock(xactStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 3613ae5f44..1ebcc7dcd3 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -60,9 +61,19 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = -1;
+uint64    prune_min_age_us;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+uint64    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                                                int nkeys,
                                                Datum v1, Datum v2,
@@ -74,6 +85,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 Index hashIndex,
                                                 Datum v1, Datum v2,
                                                 Datum v3, Datum v4);
+static bool CatCacheCleanupOldEntries(CatCache *cp);
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                                            Datum v1, Datum v2, Datum v3, Datum v4);
@@ -99,6 +111,15 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                              Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    if (newval < 0)
+        prune_min_age_us = UINT64_MAX;
+    else
+        prune_min_age_us = ((uint64) newval) * USECS_PER_SEC;
+}
 
 /*
  *                    internal support functions
@@ -1264,6 +1285,9 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Record the last access timestamp */
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1425,6 +1449,61 @@ SearchCatCacheMiss(CatCache *cache,
     return &ct->tuple;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int        nremoved = 0;
+    int        i;
+    long    oldest_ts = catcacheclock;
+    uint64    prune_threshold = catcacheclock - prune_min_age_us;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount == 0 &&
+                (ct->c_list == NULL || ct->c_list->refcount == 0))
+            {
+                if (ct->lastaccess < prune_threshold)
+                {
+                    CatCacheRemoveCTup(cp, ct);
+                    nremoved++;
+
+                    /* don't let the removed entry update oldest_ts */
+                    continue;
+                }
+            }
+
+            /* update the oldest timestamp if the entry remains alive */
+            if (ct->lastaccess < oldest_ts)
+                oldest_ts = ct->lastaccess;
+        }
+    }
+
+    cp->cc_oldest_ts = oldest_ts;
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  *    ReleaseCatCache
  *
@@ -1888,6 +1967,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -1899,7 +1979,12 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
      * arbitrarily, we enlarge when fill factor > 2.
      */
     if (cache->cc_ntup > cache->cc_nbuckets * 2)
-        RehashCatCache(cache);
+    {
+        /* try removing old entries before expanding hash */
+        if (catcacheclock - cache->cc_oldest_ts < prune_min_age_us ||
+            !CatCacheCleanupOldEntries(cache))
+            RehashCatCache(cache);
+    }
 
     return ct;
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e..95213853aa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -88,6 +88,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
@@ -3399,6 +3400,17 @@ static struct config_int ConfigureNamesInt[] =
         check_huge_page_size, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that are living unused more than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        -1, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index f4aa316604..81587c3fe6 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    uint64        cc_oldest_ts;    /* timestamp (us) of the oldest tuple */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,7 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    uint64        lastaccess;        /* timestamp in us of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern uint64 catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clock */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = (uint64) ts;
+}
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.18.4

From 2bc7eb221768ee8484fb65db48fa16f6e2c4b347 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 18 Nov 2020 16:57:05 +0900
Subject: [PATCH v5 2/3] Remove "dead" flag from catcache tuple

---
 src/backend/utils/cache/catcache.c | 43 +++++++++++++-----------------
 src/include/utils/catcache.h       | 10 -------
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 1ebcc7dcd3..3e6c4720dc 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -480,6 +480,13 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
     Assert(ct->refcount == 0);
     Assert(ct->my_cache == cache);
 
+    /* delink from linked list if not yet */
+    if (ct->cache_elem.prev)
+    {
+        dlist_delete(&ct->cache_elem);
+        ct->cache_elem.prev = NULL;
+    }
+
     if (ct->c_list)
     {
         /*
@@ -487,14 +494,10 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
          * which will recurse back to me, and the recursive call will do the
          * work.  Set the "dead" flag to make sure it does recurse.
          */
-        ct->dead = true;
         CatCacheRemoveCList(cache, ct->c_list);
         return;                    /* nothing left to do */
     }
 
-    /* delink from linked list */
-    dlist_delete(&ct->cache_elem);
-
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
      * point into tuple, allocated together with the CatCTup.
@@ -534,7 +537,7 @@ CatCacheRemoveCList(CatCache *cache, CatCList *cl)
         /* if the member is dead and now has no references, remove it */
         if (
 #ifndef CATCACHE_FORCE_RELEASE
-            ct->dead &&
+            ct->cache_elem.prev == NULL &&
 #endif
             ct->refcount == 0)
             CatCacheRemoveCTup(cache, ct);
@@ -609,7 +612,9 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             if (ct->refcount > 0 ||
                 (ct->c_list && ct->c_list->refcount > 0))
             {
-                ct->dead = true;
+                dlist_delete(&ct->cache_elem);
+                ct->cache_elem.prev = NULL;
+
                 /* list, if any, was marked dead above */
                 Assert(ct->c_list == NULL || ct->c_list->dead);
             }
@@ -688,7 +693,8 @@ ResetCatalogCache(CatCache *cache)
             if (ct->refcount > 0 ||
                 (ct->c_list && ct->c_list->refcount > 0))
             {
-                ct->dead = true;
+                dlist_delete(&ct->cache_elem);
+                ct->cache_elem.prev = NULL;
                 /* list, if any, was marked dead above */
                 Assert(ct->c_list == NULL || ct->c_list->dead);
             }
@@ -1268,9 +1274,6 @@ SearchCatCacheInternal(CatCache *cache,
     {
         ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
-        if (ct->dead)
-            continue;            /* ignore dead entries */
-
         if (ct->hash_value != hashValue)
             continue;            /* quickly skip entry if wrong hash val */
 
@@ -1522,7 +1525,6 @@ ReleaseCatCache(HeapTuple tuple)
                                   offsetof(CatCTup, tuple));
 
     /* Safety checks to ensure we were handed a cache entry */
-    Assert(ct->ct_magic == CT_MAGIC);
     Assert(ct->refcount > 0);
 
     ct->refcount--;
@@ -1530,7 +1532,7 @@ ReleaseCatCache(HeapTuple tuple)
 
     if (
 #ifndef CATCACHE_FORCE_RELEASE
-        ct->dead &&
+        ct->cache_elem.prev == NULL &&
 #endif
         ct->refcount == 0 &&
         (ct->c_list == NULL || ct->c_list->refcount == 0))
@@ -1737,8 +1739,8 @@ SearchCatCacheList(CatCache *cache,
             {
                 ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
-                if (ct->dead || ct->negative)
-                    continue;    /* ignore dead and negative entries */
+                if (ct->negative)
+                    continue;    /* ignore negative entries */
 
                 if (ct->hash_value != hashValue)
                     continue;    /* quickly skip entry if wrong hash val */
@@ -1799,14 +1801,13 @@ SearchCatCacheList(CatCache *cache,
     {
         foreach(ctlist_item, ctlist)
         {
+            Assert (ct->cache_elem.prev != NULL);
+
             ct = (CatCTup *) lfirst(ctlist_item);
             Assert(ct->c_list == NULL);
             Assert(ct->refcount > 0);
             ct->refcount--;
             if (
-#ifndef CATCACHE_FORCE_RELEASE
-                ct->dead &&
-#endif
                 ct->refcount == 0 &&
                 (ct->c_list == NULL || ct->c_list->refcount == 0))
                 CatCacheRemoveCTup(cache, ct);
@@ -1834,9 +1835,6 @@ SearchCatCacheList(CatCache *cache,
         /* release the temporary refcount on the member */
         Assert(ct->refcount > 0);
         ct->refcount--;
-        /* mark list dead if any members already dead */
-        if (ct->dead)
-            cl->dead = true;
     }
     Assert(i == nmembers);
 
@@ -1960,11 +1958,9 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
      * Finish initializing the CatCTup header, and add it to the cache's
      * linked list and counts.
      */
-    ct->ct_magic = CT_MAGIC;
     ct->my_cache = cache;
     ct->c_list = NULL;
     ct->refcount = 0;            /* for the moment */
-    ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
     ct->lastaccess = catcacheclock;
@@ -2158,9 +2154,6 @@ PrintCatCacheLeakWarning(HeapTuple tuple)
     CatCTup    *ct = (CatCTup *) (((char *) tuple) -
                                   offsetof(CatCTup, tuple));
 
-    /* Safety check to ensure we were handed a cache entry */
-    Assert(ct->ct_magic == CT_MAGIC);
-
     elog(WARNING, "cache reference leak: cache %s (%d), tuple %u/%u has count %d",
          ct->my_cache->cc_relname, ct->my_cache->id,
          ItemPointerGetBlockNumber(&(tuple->t_self)),
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 81587c3fe6..36940f4e3b 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -87,9 +87,6 @@ typedef struct catcache
 
 typedef struct catctup
 {
-    int            ct_magic;        /* for identifying CatCTup entries */
-#define CT_MAGIC   0x57261502
-
     uint32        hash_value;        /* hash value for this tuple's keys */
 
     /*
@@ -106,19 +103,12 @@ typedef struct catctup
     dlist_node    cache_elem;        /* list member of per-bucket list */
 
     /*
-     * A tuple marked "dead" must not be returned by subsequent searches.
-     * However, it won't be physically deleted from the cache until its
-     * refcount goes to zero.  (If it's a member of a CatCList, the list's
-     * refcount must go to zero, too; also, remember to mark the list dead at
-     * the same time the tuple is marked.)
-     *
      * A negative cache entry is an assertion that there is no tuple matching
      * a particular key.  This is just as useful as a normal entry so far as
      * avoiding catalog searches is concerned.  Management of positive and
      * negative entries is identical.
      */
     int            refcount;        /* number of active references */
-    bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
     uint64        lastaccess;        /* timestamp in us of the last usage */
-- 
2.18.4

From 61806132ea82e90997e38a6cb0fa0d74bc3c4c2b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 18 Nov 2020 16:56:41 +0900
Subject: [PATCH v5 3/3] catcachebench

---
 contrib/catcachebench/Makefile               |  17 +
 contrib/catcachebench/catcachebench--0.0.sql |  14 +
 contrib/catcachebench/catcachebench.c        | 330 +++++++++++++++++++
 contrib/catcachebench/catcachebench.control  |   6 +
 src/backend/utils/cache/catcache.c           |  33 ++
 src/backend/utils/cache/syscache.c           |   2 +-
 6 files changed, 401 insertions(+), 1 deletion(-)
 create mode 100644 contrib/catcachebench/Makefile
 create mode 100644 contrib/catcachebench/catcachebench--0.0.sql
 create mode 100644 contrib/catcachebench/catcachebench.c
 create mode 100644 contrib/catcachebench/catcachebench.control

diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile
new file mode 100644
index 0000000000..0478818b25
--- /dev/null
+++ b/contrib/catcachebench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = catcachebench
+OBJS = catcachebench.o
+
+EXTENSION = catcachebench
+DATA = catcachebench--0.0.sql
+PGFILEDESC = "catcachebench - benchmark for catcache pruning feature"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/catcachebench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql
new file mode 100644
index 0000000000..ea9cd62abb
--- /dev/null
+++ b/contrib/catcachebench/catcachebench--0.0.sql
@@ -0,0 +1,14 @@
+/* contrib/catcachebench/catcachebench--0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit
+
+CREATE FUNCTION catcachebench(IN type int)
+RETURNS double precision
+AS 'MODULE_PATHNAME', 'catcachebench'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits
bigint)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'catcachereadstats'
+LANGUAGE C STRICT VOLATILE;
diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
new file mode 100644
index 0000000000..b5a4d794ed
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.c
@@ -0,0 +1,330 @@
+/*
+ * catcachebench: test code for cache pruning feature
+ */
+/* #define CATCACHE_STATS */
+#include "postgres.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+
+Oid        tableoids[10000];
+int        ntables = 0;
+int16    attnums[1000];
+int        natts = 0;
+
+PG_MODULE_MAGIC;
+
+double catcachebench1(void);
+double catcachebench2(void);
+double catcachebench3(void);
+void collectinfo(void);
+void catcachewarmup(void);
+
+PG_FUNCTION_INFO_V1(catcachebench);
+PG_FUNCTION_INFO_V1(catcachereadstats);
+
+extern void CatalogCacheFlushCatalog2(Oid catId);
+extern int64 catcache_called;
+extern CatCache *SysCache[];
+
+typedef struct catcachestatsstate
+{
+    TupleDesc tupd;
+    int          catId;
+} catcachestatsstate;
+
+Datum
+catcachereadstats(PG_FUNCTION_ARGS)
+{
+    catcachestatsstate *state_data = NULL;
+    FuncCallContext *fctx;
+
+    if (SRF_IS_FIRSTCALL())
+    {
+        TupleDesc    tupdesc;
+        MemoryContext mctx;
+
+        fctx = SRF_FIRSTCALL_INIT();
+        mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+        state_data = palloc(sizeof(catcachestatsstate));
+
+        /* Build a tuple descriptor for our result type */
+        if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+            elog(ERROR, "return type must be a row type");
+
+        state_data->tupd = tupdesc;
+        state_data->catId = 0;
+
+        fctx->user_fctx = state_data;
+
+        MemoryContextSwitchTo(mctx);
+    }
+
+    fctx = SRF_PERCALL_SETUP();
+    state_data = fctx->user_fctx;
+
+    if (state_data->catId < SysCacheSize)
+    {
+        Datum    values[5];
+        bool    nulls[5];
+        HeapTuple    resulttup;
+        Datum    result;
+        int        catId = state_data->catId++;
+
+        memset(nulls, 0, sizeof(nulls));
+        memset(values, 0, sizeof(values));
+        values[0] = Int16GetDatum(catId);
+        values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid);
+#ifdef CATCACHE_STATS        
+        values[2] = Int64GetDatum(SysCache[catId]->cc_searches);
+        values[3] = Int64GetDatum(SysCache[catId]->cc_hits);
+        values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits);
+#endif
+        resulttup = heap_form_tuple(state_data->tupd, values, nulls);
+        result = HeapTupleGetDatum(resulttup);
+
+        SRF_RETURN_NEXT(fctx, result);
+    }
+
+    SRF_RETURN_DONE(fctx);
+}
+
+Datum
+catcachebench(PG_FUNCTION_ARGS)
+{
+    int        testtype = PG_GETARG_INT32(0);
+    double    ms;
+
+    collectinfo();
+
+    /* flush the catalog -- safe? don't mind. */
+    CatalogCacheFlushCatalog2(StatisticRelationId);
+
+    switch (testtype)
+    {
+    case 0:
+        catcachewarmup(); /* prewarm of syscatalog */
+        PG_RETURN_NULL();
+    case 1:
+        ms = catcachebench1(); break;
+    case 2:
+        ms = catcachebench2(); break;
+    case 3:
+        ms = catcachebench3(); break;
+    default:
+        elog(ERROR, "Invalid test type: %d", testtype);
+    }
+
+    PG_RETURN_DATUM(Float8GetDatum(ms));
+}
+
+/*
+ * fetch all attribute entires of all tables.
+ */
+double
+catcachebench1(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of a table 6000 times.
+ */
+double
+catcachebench2(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < 240000 ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[0]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of all tables twice with having expiration
+ * happen.
+ */
+double
+catcachebench3(void)
+{
+    const int clock_step = 1000;
+    int i, t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (i = 0 ; i < 4 ; i++)
+    {
+        int ct = clock_step;
+
+        for (t = 0 ; t < ntables ; t++)
+        {
+            /*
+             * catcacheclock is updated by transaction timestamp, so needs to
+             * be updated by other means for this test to work. Here I choosed
+             * to update the clock every 1000 tables scan.
+             */
+            if (--ct < 0)
+            {
+                SetCatCacheClock(GetCurrentTimestamp());
+                ct = clock_step;
+            }
+            for (a = 0 ; a < natts ; a++)
+            {
+                HeapTuple tup;
+
+                tup = SearchSysCache3(STATRELATTINH,
+                                      ObjectIdGetDatum(tableoids[t]),
+                                      Int16GetDatum(attnums[a]),
+                                      BoolGetDatum(false));
+                /* should be null, but.. */
+                if (HeapTupleIsValid(tup))
+                    ReleaseSysCache(tup);
+            }
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+void
+catcachewarmup(void)
+{
+    int t, a;
+
+    /* load up catalog tables */
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+}
+
+void
+collectinfo(void)
+{
+    int ret;
+    Datum    values[10000];
+    bool    nulls[10000];
+    Oid        types0[] = {OIDOID};
+    int i;
+
+    ntables = 0;
+    natts = 0;
+
+    SPI_connect();
+    /* collect target tables */
+    ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname =
\'test\')",
+                      true, 0);
+    if (ret != SPI_OK_SELECT)
+        elog(ERROR, "Failed 1");
+    if (SPI_processed == 0)
+        elog(ERROR, "no relation found in schema \"test\"");
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in schema \"test\"");
+
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 2");
+
+        tableoids[ntables++] = DatumGetObjectId(values[0]);
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d tables found", ntables);
+
+    values[0] = ObjectIdGetDatum(tableoids[0]);
+    nulls[0] = false;
+    SPI_connect();
+    ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid
=$1)",
 
+                                1, types0, values, NULL, true, 0);
+    if (SPI_processed == 0)
+        elog(ERROR, "no attribute found in table %d", tableoids[0]);
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in table %d", tableoids[0]);
+    
+    /* collect target attributes. assuming all tables have the same attnums */
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        int16 attnum;
+
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 3");
+        attnum = DatumGetInt16(values[0]);
+
+        if (attnum > 0)
+            attnums[natts++] = attnum;
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d attributes found", natts);
+}
diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control
new file mode 100644
index 0000000000..3fc9d2e420
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.control
@@ -0,0 +1,6 @@
+# catcachebench
+
+comment = 'benchmark for catcache pruning'
+default_version = '0.0'
+module_pathname = '$libdir/catcachebench'
+relocatable = true
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 3e6c4720dc..294d906416 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -767,6 +767,39 @@ CatalogCacheFlushCatalog(Oid catId)
     CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
 }
 
+
+/* FUNCTION FOR BENCHMARKING */
+void
+CatalogCacheFlushCatalog2(Oid catId)
+{
+    slist_iter    iter;
+
+    CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId);
+
+    slist_foreach(iter, &CacheHdr->ch_caches)
+    {
+        CatCache   *cache = slist_container(CatCache, cc_next, iter.cur);
+
+        /* Does this cache store tuples of the target catalog? */
+        if (cache->cc_reloid == catId)
+        {
+            /* Yes, so flush all its contents */
+            ResetCatalogCache(cache);
+
+            /* Tell inval.c to call syscache callbacks for this cache */
+            CallSyscacheCallbacks(cache->id, 0);
+
+            cache->cc_nbuckets = 128;
+            pfree(cache->cc_bucket);
+            cache->cc_bucket = palloc0(128 * sizeof(dlist_head));
+            elog(LOG, "Catcache reset");
+        }
+    }
+
+    CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
+}
+/* END: FUNCTION FOR BENCHMARKING */
+
 /*
  *        InitCatCache
  *
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index 809b27a038..e83b3f66d1 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -982,7 +982,7 @@ static const struct cachedesc cacheinfo[] = {
     }
 };
 
-static CatCache *SysCache[SysCacheSize];
+CatCache *SysCache[SysCacheSize];
 
 static bool CacheInitialized = false;
 
-- 
2.18.4


Re: Protect syscache from bloating with negative cache entries

От
Andres Freund
Дата:
Hi,

On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote:
> # Creation, searching and expiration
> master     :  6393.23    (100.0)
> patched-off:  6527.94    (102.1)
> patched-on : 15880.01    (248.4)

What's the deal with this massive increase here?

Greetings,

Andres Freund



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
At Wed, 18 Nov 2020 21:42:02 -0800, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote:
> > # Creation, searching and expiration
> > master     :  6393.23    (100.0)
> > patched-off:  6527.94    (102.1)
> > patched-on : 15880.01    (248.4)
> 
> What's the deal with this massive increase here?

CatCacheRemovedCTup(). If I replaced a call to the function in the
cleanup functoin with dlist_delete(), the result changes as:

master      :  6372.04   (100.0) (2)
patched-off :  6464.97   (101.5) (2)
patched-on  :  5354.42   ( 84.0) (2)

We could boost the expiration if we reuse the "deleted" entry at the
next entry creation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
At Thu, 19 Nov 2020 15:23:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 18 Nov 2020 21:42:02 -0800, Andres Freund <andres@anarazel.de> wrote in 
> > Hi,
> > 
> > On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote:
> > > # Creation, searching and expiration
> > > master     :  6393.23    (100.0)
> > > patched-off:  6527.94    (102.1)
> > > patched-on : 15880.01    (248.4)
> > 
> > What's the deal with this massive increase here?
> 
> CatCacheRemovedCTup(). If I replaced a call to the function in the
> cleanup functoin with dlist_delete(), the result changes as:
> 
> master      :  6372.04   (100.0) (2)
> patched-off :  6464.97   (101.5) (2)
> patched-on  :  5354.42   ( 84.0) (2)
> 
> We could boost the expiration if we reuse the "deleted" entry at the
> next entry creation.

That result should be bogus. It forgot to update cc_ntup..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
Ah. It was obvious from the first.

Sorry for the sloppy diagnosis.

At Fri, 20 Nov 2020 16:08:40 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 19 Nov 2020 15:23:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > At Wed, 18 Nov 2020 21:42:02 -0800, Andres Freund <andres@anarazel.de> wrote in 
> > > Hi,
> > > 
> > > On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote:
> > > > # Creation, searching and expiration
> > > > master     :  6393.23    (100.0)
> > > > patched-off:  6527.94    (102.1)
> > > > patched-on : 15880.01    (248.4)
> > > 
> > > What's the deal with this massive increase here?

catalog_cache_min_prune_age was set to 0 at the time, so almost all
catcache entries are dropped at rehashing time. Most of the difference
should be the time to search on the system catalog.


2020-11-20 16:25:25.988  LOG:  database system is ready to accept connections
2020-11-20 16:26:48.504  LOG:  Catcache reset
2020-11-20 16:26:48.504  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 257: 0.001500 ms
2020-11-20 16:26:48.504  LOG:  rehashed catalog cache id 58 for pg_statistic; 257 tups, 256 buckets, 0.020748 ms
2020-11-20 16:26:48.505  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 513: 0.003221 ms
2020-11-20 16:26:48.505  LOG:  rehashed catalog cache id 58 for pg_statistic; 513 tups, 512 buckets, 0.006962 ms
2020-11-20 16:26:48.505  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 1025: 0.006744 ms
2020-11-20 16:26:48.505  LOG:  rehashed catalog cache id 58 for pg_statistic; 1025 tups, 1024 buckets, 0.009580 ms
2020-11-20 16:26:48.507  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 2049: 0.015683 ms
2020-11-20 16:26:48.507  LOG:  rehashed catalog cache id 58 for pg_statistic; 2049 tups, 2048 buckets, 0.041008 ms
2020-11-20 16:26:48.509  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 4097: 0.042438 ms
2020-11-20 16:26:48.509  LOG:  rehashed catalog cache id 58 for pg_statistic; 4097 tups, 4096 buckets, 0.077379 ms
2020-11-20 16:26:48.515  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 8193: 0.123798 ms
2020-11-20 16:26:48.515  LOG:  rehashed catalog cache id 58 for pg_statistic; 8193 tups, 8192 buckets, 0.198505 ms
2020-11-20 16:26:48.525  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 16385: 0.180831 ms
2020-11-20 16:26:48.526  LOG:  rehashed catalog cache id 58 for pg_statistic; 16385 tups, 16384 buckets, 0.361109 ms
2020-11-20 16:26:48.546  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 32769: 0.717899 ms
2020-11-20 16:26:48.547  LOG:  rehashed catalog cache id 58 for pg_statistic; 32769 tups, 32768 buckets, 1.443587 ms
2020-11-20 16:26:48.588  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 65537: 1.204804 ms
2020-11-20 16:26:48.591  LOG:  rehashed catalog cache id 58 for pg_statistic; 65537 tups, 65536 buckets, 3.069916 ms
2020-11-20 16:26:48.674  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 131073: 2.707709 ms
2020-11-20 16:26:48.681  LOG:  rehashed catalog cache id 58 for pg_statistic; 131073 tups, 131072 buckets, 7.127622 ms
2020-11-20 16:26:48.848  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 262145: 5.895630 ms
2020-11-20 16:26:48.862  LOG:  rehashed catalog cache id 58 for pg_statistic; 262145 tups, 262144 buckets, 13.433610
ms
2020-11-20 16:26:49.195  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 524289: 12.302632 ms
2020-11-20 16:26:49.223  LOG:  rehashed catalog cache id 58 for pg_statistic; 524289 tups, 524288 buckets, 27.710900
ms
2020-11-20 16:26:49.937  LOG:  pruning catalog cache id=58 for pg_statistic: removed 1001000 / 1048577: 66.062629 ms
2020-11-20 16:26:51.195  LOG:  pruning catalog cache id=58 for pg_statistic: removed 1002001 / 1048577: 65.533468 ms
2020-11-20 16:26:52.413  LOG:  pruning catalog cache id=58 for pg_statistic: removed 0 / 1048577: 25.623740 ms
2020-11-20 16:26:52.468  LOG:  rehashed catalog cache id 58 for pg_statistic; 1048577 tups, 1048576 buckets, 54.314825
ms
2020-11-20 16:26:53.898  LOG:  pruning catalog cache id=58 for pg_statistic: removed 2000999 / 2097153: 134.530582 ms
2020-11-20 16:26:56.404  LOG:  pruning catalog cache id=58 for pg_statistic: removed 1002001 / 2097153: 111.634597 ms
2020-11-20 16:26:57.779  LOG:  pruning catalog cache id=58 for pg_statistic: removed 2000999 / 2097153: 134.628430 ms
2020-11-20 16:27:00.389  LOG:  pruning catalog cache id=58 for pg_statistic: removed 1002001 / 2097153: 147.221688 ms
2020-11-20 16:27:01.851  LOG:  pruning catalog cache id=58 for pg_statistic: removed 2000999 / 2097153: 177.610820 ms

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
Hello.

The commit 4656e3d668 (debug_invalidate_system_caches_always)
conflicted with this patch. Rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ec069488fd2675369530f3f967f02a7b683f0a7f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 18 Nov 2020 16:54:31 +0900
Subject: [PATCH v6 1/3] CatCache expiration feature

---
 src/backend/access/transam/xact.c  |  3 ++
 src/backend/utils/cache/catcache.c | 87 +++++++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c       | 12 +++++
 src/include/utils/catcache.h       | 19 +++++++
 4 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3fd4..86888d2409 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1086,6 +1086,9 @@ static void
 AtStart_Cache(void)
 {
     AcceptInvalidationMessages();
+
+    if (xactStartTimestamp != 0)
+        SetCatCacheClock(xactStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index fa2b49c676..644d92dd9a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
  /* #define CACHEDEBUG */    /* turns DEBUG elogs on */
@@ -60,9 +61,19 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = -1;
+uint64    prune_min_age_us;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+uint64    catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
                                                int nkeys,
                                                Datum v1, Datum v2,
@@ -74,6 +85,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
                                                 Index hashIndex,
                                                 Datum v1, Datum v2,
                                                 Datum v3, Datum v4);
+static bool CatCacheCleanupOldEntries(CatCache *cp);
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
                                            Datum v1, Datum v2, Datum v3, Datum v4);
@@ -99,6 +111,15 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
                              Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+    if (newval < 0)
+        prune_min_age_us = UINT64_MAX;
+    else
+        prune_min_age_us = ((uint64) newval) * USECS_PER_SEC;
+}
 
 /*
  *                    internal support functions
@@ -1264,6 +1285,9 @@ SearchCatCacheInternal(CatCache *cache,
          */
         dlist_move_head(bucket, &ct->cache_elem);
 
+        /* Record the last access timestamp */
+        ct->lastaccess = catcacheclock;
+
         /*
          * If it's a positive entry, bump its refcount and return it. If it's
          * negative, we can report failure to the caller.
@@ -1425,6 +1449,61 @@ SearchCatCacheMiss(CatCache *cache,
     return &ct->tuple;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+    int        nremoved = 0;
+    int        i;
+    long    oldest_ts = catcacheclock;
+    uint64    prune_threshold = catcacheclock - prune_min_age_us;
+
+    /* Scan over the whole hash to find entries to remove */
+    for (i = 0 ; i < cp->cc_nbuckets ; i++)
+    {
+        dlist_mutable_iter    iter;
+
+        dlist_foreach_modify(iter, &cp->cc_bucket[i])
+        {
+            CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+            /* Don't remove referenced entries */
+            if (ct->refcount == 0 &&
+                (ct->c_list == NULL || ct->c_list->refcount == 0))
+            {
+                if (ct->lastaccess < prune_threshold)
+                {
+                    CatCacheRemoveCTup(cp, ct);
+                    nremoved++;
+
+                    /* don't let the removed entry update oldest_ts */
+                    continue;
+                }
+            }
+
+            /* update the oldest timestamp if the entry remains alive */
+            if (ct->lastaccess < oldest_ts)
+                oldest_ts = ct->lastaccess;
+        }
+    }
+
+    cp->cc_oldest_ts = oldest_ts;
+
+    if (nremoved > 0)
+        elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+             cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+    return nremoved > 0;
+}
+
 /*
  *    ReleaseCatCache
  *
@@ -1888,6 +1967,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
     ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
+    ct->lastaccess = catcacheclock;
 
     dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
 
@@ -1899,7 +1979,12 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
      * arbitrarily, we enlarge when fill factor > 2.
      */
     if (cache->cc_ntup > cache->cc_nbuckets * 2)
-        RehashCatCache(cache);
+    {
+        /* try removing old entries before expanding hash */
+        if (catcacheclock - cache->cc_oldest_ts < prune_min_age_us ||
+            !CatCacheCleanupOldEntries(cache))
+            RehashCatCache(cache);
+    }
 
     return ct;
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..255e9fa73d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -88,6 +88,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
+#include "utils/catcache.h"
 #include "utils/float.h"
 #include "utils/guc_tables.h"
 #include "utils/memutils.h"
@@ -3445,6 +3446,17 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM,
+            gettext_noop("System catalog cache entries that are living unused more than this seconds are considered
forremoval."),
 
+            gettext_noop("The value of -1 turns off pruning."),
+            GUC_UNIT_S
+        },
+        &catalog_cache_prune_min_age,
+        -1, -1, INT_MAX,
+        NULL, assign_catalog_cache_prune_min_age, NULL
+    },
+
     /* End-of-list marker */
     {
         {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index ddc2762eb3..291e857e38 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -22,6 +22,7 @@
 
 #include "access/htup.h"
 #include "access/skey.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "utils/relcache.h"
 
@@ -61,6 +62,7 @@ typedef struct catcache
     slist_node    cc_next;        /* list link */
     ScanKeyData cc_skey[CATCACHE_MAXKEYS];    /* precomputed key info for heap
                                              * scans */
+    uint64        cc_oldest_ts;    /* timestamp (us) of the oldest tuple */
 
     /*
      * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
@@ -119,6 +121,7 @@ typedef struct catctup
     bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
+    uint64        lastaccess;        /* timestamp in us of the last usage */
 
     /*
      * The tuple may also be a member of at most one CatCList.  (If a single
@@ -189,6 +192,22 @@ typedef struct catcacheheader
 /* this extern duplicates utils/memutils.h... */
 extern PGDLLIMPORT MemoryContext CacheMemoryContext;
 
+
+/* for guc.c, not PGDLLPMPORT'ed */
+extern int catalog_cache_prune_min_age;
+
+/* source clock for access timestamp of catcache entries */
+extern uint64 catcacheclock;
+
+/* SetCatCacheClock - set catcache timestamp source clock */
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = (uint64) ts;
+}
+
+extern void assign_catalog_cache_prune_min_age(int newval, void *extra);
+
 extern void CreateCacheMemoryContext(void);
 
 extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
-- 
2.27.0

From 95b39756890b7f53b99e20180ad1a62b450ef237 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 18 Nov 2020 16:57:05 +0900
Subject: [PATCH v6 2/3] Remove "dead" flag from catcache tuple

---
 src/backend/utils/cache/catcache.c | 43 +++++++++++++-----------------
 src/include/utils/catcache.h       | 10 -------
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 644d92dd9a..611b65168d 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -480,6 +480,13 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
     Assert(ct->refcount == 0);
     Assert(ct->my_cache == cache);
 
+    /* delink from linked list if not yet */
+    if (ct->cache_elem.prev)
+    {
+        dlist_delete(&ct->cache_elem);
+        ct->cache_elem.prev = NULL;
+    }
+
     if (ct->c_list)
     {
         /*
@@ -487,14 +494,10 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
          * which will recurse back to me, and the recursive call will do the
          * work.  Set the "dead" flag to make sure it does recurse.
          */
-        ct->dead = true;
         CatCacheRemoveCList(cache, ct->c_list);
         return;                    /* nothing left to do */
     }
 
-    /* delink from linked list */
-    dlist_delete(&ct->cache_elem);
-
     /*
      * Free keys when we're dealing with a negative entry, normal entries just
      * point into tuple, allocated together with the CatCTup.
@@ -534,7 +537,7 @@ CatCacheRemoveCList(CatCache *cache, CatCList *cl)
         /* if the member is dead and now has no references, remove it */
         if (
 #ifndef CATCACHE_FORCE_RELEASE
-            ct->dead &&
+            ct->cache_elem.prev == NULL &&
 #endif
             ct->refcount == 0)
             CatCacheRemoveCTup(cache, ct);
@@ -609,7 +612,9 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
             if (ct->refcount > 0 ||
                 (ct->c_list && ct->c_list->refcount > 0))
             {
-                ct->dead = true;
+                dlist_delete(&ct->cache_elem);
+                ct->cache_elem.prev = NULL;
+
                 /* list, if any, was marked dead above */
                 Assert(ct->c_list == NULL || ct->c_list->dead);
             }
@@ -688,7 +693,8 @@ ResetCatalogCache(CatCache *cache)
             if (ct->refcount > 0 ||
                 (ct->c_list && ct->c_list->refcount > 0))
             {
-                ct->dead = true;
+                dlist_delete(&ct->cache_elem);
+                ct->cache_elem.prev = NULL;
                 /* list, if any, was marked dead above */
                 Assert(ct->c_list == NULL || ct->c_list->dead);
             }
@@ -1268,9 +1274,6 @@ SearchCatCacheInternal(CatCache *cache,
     {
         ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
-        if (ct->dead)
-            continue;            /* ignore dead entries */
-
         if (ct->hash_value != hashValue)
             continue;            /* quickly skip entry if wrong hash val */
 
@@ -1522,7 +1525,6 @@ ReleaseCatCache(HeapTuple tuple)
                                   offsetof(CatCTup, tuple));
 
     /* Safety checks to ensure we were handed a cache entry */
-    Assert(ct->ct_magic == CT_MAGIC);
     Assert(ct->refcount > 0);
 
     ct->refcount--;
@@ -1530,7 +1532,7 @@ ReleaseCatCache(HeapTuple tuple)
 
     if (
 #ifndef CATCACHE_FORCE_RELEASE
-        ct->dead &&
+        ct->cache_elem.prev == NULL &&
 #endif
         ct->refcount == 0 &&
         (ct->c_list == NULL || ct->c_list->refcount == 0))
@@ -1737,8 +1739,8 @@ SearchCatCacheList(CatCache *cache,
             {
                 ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
-                if (ct->dead || ct->negative)
-                    continue;    /* ignore dead and negative entries */
+                if (ct->negative)
+                    continue;    /* ignore negative entries */
 
                 if (ct->hash_value != hashValue)
                     continue;    /* quickly skip entry if wrong hash val */
@@ -1799,14 +1801,13 @@ SearchCatCacheList(CatCache *cache,
     {
         foreach(ctlist_item, ctlist)
         {
+            Assert (ct->cache_elem.prev != NULL);
+
             ct = (CatCTup *) lfirst(ctlist_item);
             Assert(ct->c_list == NULL);
             Assert(ct->refcount > 0);
             ct->refcount--;
             if (
-#ifndef CATCACHE_FORCE_RELEASE
-                ct->dead &&
-#endif
                 ct->refcount == 0 &&
                 (ct->c_list == NULL || ct->c_list->refcount == 0))
                 CatCacheRemoveCTup(cache, ct);
@@ -1834,9 +1835,6 @@ SearchCatCacheList(CatCache *cache,
         /* release the temporary refcount on the member */
         Assert(ct->refcount > 0);
         ct->refcount--;
-        /* mark list dead if any members already dead */
-        if (ct->dead)
-            cl->dead = true;
     }
     Assert(i == nmembers);
 
@@ -1960,11 +1958,9 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
      * Finish initializing the CatCTup header, and add it to the cache's
      * linked list and counts.
      */
-    ct->ct_magic = CT_MAGIC;
     ct->my_cache = cache;
     ct->c_list = NULL;
     ct->refcount = 0;            /* for the moment */
-    ct->dead = false;
     ct->negative = negative;
     ct->hash_value = hashValue;
     ct->lastaccess = catcacheclock;
@@ -2158,9 +2154,6 @@ PrintCatCacheLeakWarning(HeapTuple tuple)
     CatCTup    *ct = (CatCTup *) (((char *) tuple) -
                                   offsetof(CatCTup, tuple));
 
-    /* Safety check to ensure we were handed a cache entry */
-    Assert(ct->ct_magic == CT_MAGIC);
-
     elog(WARNING, "cache reference leak: cache %s (%d), tuple %u/%u has count %d",
          ct->my_cache->cc_relname, ct->my_cache->id,
          ItemPointerGetBlockNumber(&(tuple->t_self)),
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 291e857e38..53b0bf31eb 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -87,9 +87,6 @@ typedef struct catcache
 
 typedef struct catctup
 {
-    int            ct_magic;        /* for identifying CatCTup entries */
-#define CT_MAGIC   0x57261502
-
     uint32        hash_value;        /* hash value for this tuple's keys */
 
     /*
@@ -106,19 +103,12 @@ typedef struct catctup
     dlist_node    cache_elem;        /* list member of per-bucket list */
 
     /*
-     * A tuple marked "dead" must not be returned by subsequent searches.
-     * However, it won't be physically deleted from the cache until its
-     * refcount goes to zero.  (If it's a member of a CatCList, the list's
-     * refcount must go to zero, too; also, remember to mark the list dead at
-     * the same time the tuple is marked.)
-     *
      * A negative cache entry is an assertion that there is no tuple matching
      * a particular key.  This is just as useful as a normal entry so far as
      * avoiding catalog searches is concerned.  Management of positive and
      * negative entries is identical.
      */
     int            refcount;        /* number of active references */
-    bool        dead;            /* dead but not yet removed? */
     bool        negative;        /* negative cache entry? */
     HeapTupleData tuple;        /* tuple management header */
     uint64        lastaccess;        /* timestamp in us of the last usage */
-- 
2.27.0

From e706934b35f6d6df20c09532d3c53a520cd704cc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 18 Nov 2020 16:56:41 +0900
Subject: [PATCH v6 3/3] catcachebench

---
 contrib/catcachebench/Makefile               |  17 +
 contrib/catcachebench/catcachebench--0.0.sql |  14 +
 contrib/catcachebench/catcachebench.c        | 330 +++++++++++++++++++
 contrib/catcachebench/catcachebench.control  |   6 +
 src/backend/utils/cache/catcache.c           |  33 ++
 src/backend/utils/cache/syscache.c           |   2 +-
 6 files changed, 401 insertions(+), 1 deletion(-)
 create mode 100644 contrib/catcachebench/Makefile
 create mode 100644 contrib/catcachebench/catcachebench--0.0.sql
 create mode 100644 contrib/catcachebench/catcachebench.c
 create mode 100644 contrib/catcachebench/catcachebench.control

diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile
new file mode 100644
index 0000000000..0478818b25
--- /dev/null
+++ b/contrib/catcachebench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = catcachebench
+OBJS = catcachebench.o
+
+EXTENSION = catcachebench
+DATA = catcachebench--0.0.sql
+PGFILEDESC = "catcachebench - benchmark for catcache pruning feature"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/catcachebench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql
new file mode 100644
index 0000000000..ea9cd62abb
--- /dev/null
+++ b/contrib/catcachebench/catcachebench--0.0.sql
@@ -0,0 +1,14 @@
+/* contrib/catcachebench/catcachebench--0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit
+
+CREATE FUNCTION catcachebench(IN type int)
+RETURNS double precision
+AS 'MODULE_PATHNAME', 'catcachebench'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits
bigint)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'catcachereadstats'
+LANGUAGE C STRICT VOLATILE;
diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
new file mode 100644
index 0000000000..b5a4d794ed
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.c
@@ -0,0 +1,330 @@
+/*
+ * catcachebench: test code for cache pruning feature
+ */
+/* #define CATCACHE_STATS */
+#include "postgres.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+
+Oid        tableoids[10000];
+int        ntables = 0;
+int16    attnums[1000];
+int        natts = 0;
+
+PG_MODULE_MAGIC;
+
+double catcachebench1(void);
+double catcachebench2(void);
+double catcachebench3(void);
+void collectinfo(void);
+void catcachewarmup(void);
+
+PG_FUNCTION_INFO_V1(catcachebench);
+PG_FUNCTION_INFO_V1(catcachereadstats);
+
+extern void CatalogCacheFlushCatalog2(Oid catId);
+extern int64 catcache_called;
+extern CatCache *SysCache[];
+
+typedef struct catcachestatsstate
+{
+    TupleDesc tupd;
+    int          catId;
+} catcachestatsstate;
+
+Datum
+catcachereadstats(PG_FUNCTION_ARGS)
+{
+    catcachestatsstate *state_data = NULL;
+    FuncCallContext *fctx;
+
+    if (SRF_IS_FIRSTCALL())
+    {
+        TupleDesc    tupdesc;
+        MemoryContext mctx;
+
+        fctx = SRF_FIRSTCALL_INIT();
+        mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+        state_data = palloc(sizeof(catcachestatsstate));
+
+        /* Build a tuple descriptor for our result type */
+        if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+            elog(ERROR, "return type must be a row type");
+
+        state_data->tupd = tupdesc;
+        state_data->catId = 0;
+
+        fctx->user_fctx = state_data;
+
+        MemoryContextSwitchTo(mctx);
+    }
+
+    fctx = SRF_PERCALL_SETUP();
+    state_data = fctx->user_fctx;
+
+    if (state_data->catId < SysCacheSize)
+    {
+        Datum    values[5];
+        bool    nulls[5];
+        HeapTuple    resulttup;
+        Datum    result;
+        int        catId = state_data->catId++;
+
+        memset(nulls, 0, sizeof(nulls));
+        memset(values, 0, sizeof(values));
+        values[0] = Int16GetDatum(catId);
+        values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid);
+#ifdef CATCACHE_STATS        
+        values[2] = Int64GetDatum(SysCache[catId]->cc_searches);
+        values[3] = Int64GetDatum(SysCache[catId]->cc_hits);
+        values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits);
+#endif
+        resulttup = heap_form_tuple(state_data->tupd, values, nulls);
+        result = HeapTupleGetDatum(resulttup);
+
+        SRF_RETURN_NEXT(fctx, result);
+    }
+
+    SRF_RETURN_DONE(fctx);
+}
+
+Datum
+catcachebench(PG_FUNCTION_ARGS)
+{
+    int        testtype = PG_GETARG_INT32(0);
+    double    ms;
+
+    collectinfo();
+
+    /* flush the catalog -- safe? don't mind. */
+    CatalogCacheFlushCatalog2(StatisticRelationId);
+
+    switch (testtype)
+    {
+    case 0:
+        catcachewarmup(); /* prewarm of syscatalog */
+        PG_RETURN_NULL();
+    case 1:
+        ms = catcachebench1(); break;
+    case 2:
+        ms = catcachebench2(); break;
+    case 3:
+        ms = catcachebench3(); break;
+    default:
+        elog(ERROR, "Invalid test type: %d", testtype);
+    }
+
+    PG_RETURN_DATUM(Float8GetDatum(ms));
+}
+
+/*
+ * fetch all attribute entires of all tables.
+ */
+double
+catcachebench1(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of a table 6000 times.
+ */
+double
+catcachebench2(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < 240000 ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[0]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of all tables twice with having expiration
+ * happen.
+ */
+double
+catcachebench3(void)
+{
+    const int clock_step = 1000;
+    int i, t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (i = 0 ; i < 4 ; i++)
+    {
+        int ct = clock_step;
+
+        for (t = 0 ; t < ntables ; t++)
+        {
+            /*
+             * catcacheclock is updated by transaction timestamp, so needs to
+             * be updated by other means for this test to work. Here I choosed
+             * to update the clock every 1000 tables scan.
+             */
+            if (--ct < 0)
+            {
+                SetCatCacheClock(GetCurrentTimestamp());
+                ct = clock_step;
+            }
+            for (a = 0 ; a < natts ; a++)
+            {
+                HeapTuple tup;
+
+                tup = SearchSysCache3(STATRELATTINH,
+                                      ObjectIdGetDatum(tableoids[t]),
+                                      Int16GetDatum(attnums[a]),
+                                      BoolGetDatum(false));
+                /* should be null, but.. */
+                if (HeapTupleIsValid(tup))
+                    ReleaseSysCache(tup);
+            }
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+void
+catcachewarmup(void)
+{
+    int t, a;
+
+    /* load up catalog tables */
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+}
+
+void
+collectinfo(void)
+{
+    int ret;
+    Datum    values[10000];
+    bool    nulls[10000];
+    Oid        types0[] = {OIDOID};
+    int i;
+
+    ntables = 0;
+    natts = 0;
+
+    SPI_connect();
+    /* collect target tables */
+    ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname =
\'test\')",
+                      true, 0);
+    if (ret != SPI_OK_SELECT)
+        elog(ERROR, "Failed 1");
+    if (SPI_processed == 0)
+        elog(ERROR, "no relation found in schema \"test\"");
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in schema \"test\"");
+
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 2");
+
+        tableoids[ntables++] = DatumGetObjectId(values[0]);
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d tables found", ntables);
+
+    values[0] = ObjectIdGetDatum(tableoids[0]);
+    nulls[0] = false;
+    SPI_connect();
+    ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid
=$1)",
 
+                                1, types0, values, NULL, true, 0);
+    if (SPI_processed == 0)
+        elog(ERROR, "no attribute found in table %d", tableoids[0]);
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in table %d", tableoids[0]);
+    
+    /* collect target attributes. assuming all tables have the same attnums */
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        int16 attnum;
+
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 3");
+        attnum = DatumGetInt16(values[0]);
+
+        if (attnum > 0)
+            attnums[natts++] = attnum;
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d attributes found", natts);
+}
diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control
new file mode 100644
index 0000000000..3fc9d2e420
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.control
@@ -0,0 +1,6 @@
+# catcachebench
+
+comment = 'benchmark for catcache pruning'
+default_version = '0.0'
+module_pathname = '$libdir/catcachebench'
+relocatable = true
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 611b65168d..aabea861ce 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -767,6 +767,39 @@ CatalogCacheFlushCatalog(Oid catId)
     CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
 }
 
+
+/* FUNCTION FOR BENCHMARKING */
+void
+CatalogCacheFlushCatalog2(Oid catId)
+{
+    slist_iter    iter;
+
+    CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId);
+
+    slist_foreach(iter, &CacheHdr->ch_caches)
+    {
+        CatCache   *cache = slist_container(CatCache, cc_next, iter.cur);
+
+        /* Does this cache store tuples of the target catalog? */
+        if (cache->cc_reloid == catId)
+        {
+            /* Yes, so flush all its contents */
+            ResetCatalogCache(cache);
+
+            /* Tell inval.c to call syscache callbacks for this cache */
+            CallSyscacheCallbacks(cache->id, 0);
+
+            cache->cc_nbuckets = 128;
+            pfree(cache->cc_bucket);
+            cache->cc_bucket = palloc0(128 * sizeof(dlist_head));
+            elog(LOG, "Catcache reset");
+        }
+    }
+
+    CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
+}
+/* END: FUNCTION FOR BENCHMARKING */
+
 /*
  *        InitCatCache
  *
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index e4dc4ee34e..b60416ec63 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -994,7 +994,7 @@ static const struct cachedesc cacheinfo[] = {
     }
 };
 
-static CatCache *SysCache[SysCacheSize];
+CatCache *SysCache[SysCacheSize];
 
 static bool CacheInitialized = false;
 
-- 
2.27.0


Re: Protect syscache from bloating with negative cache entries

От
Heikki Linnakangas
Дата:
Hi,

On 19/11/2020 07:25, Kyotaro Horiguchi wrote:
> Performance measurement on the attached showed better result about
> searching but maybe worse for cache entry creation.  Each time number
> is the mean of 10 runs.
> 
> # Cacache (negative) entry creation
>             :  time(ms) (% to master)
> master     :  3965.61    (100.0)
> patched-off:  4040.93    (101.9)
> patched-on :  4032.22    (101.7)
> 
> # Searching negative cache entries
> master     :  8173.46    (100.0)
> patched-off:  7983.43    ( 97.7)
> patched-on :  8049.88    ( 98.5)
> 
> # Creation, searching and expiration
> master     :  6393.23    (100.0)
> patched-off:  6527.94    (102.1)
> patched-on : 15880.01    (248.4)
> 
> 
> That is, catcache searching gets faster by 2-3% but creation gets
> slower by about 2%. If I moved the condition of 2 further up to
> CatalogCacheCreateEntry(), that degradation reduced to 0.6%.
> 
> # Cacache (negative) entry creation
> master      :  3967.45   (100.0)
> patched-off :  3990.43   (100.6)
> patched-on  :  4108.96   (103.6)
> 
> # Searching negative cache entries
> master      :  8106.53   (100.0)
> patched-off :  8036.61   ( 99.1)
> patched-on  :  8058.18   ( 99.4)
> 
> # Creation, searching and expiration
> master      :  6395.00   (100.0)
> patched-off :  6416.57   (100.3)
> patched-on  : 15830.91   (247.6)

Can you share the exact script or steps to reproduce these numbers? I 
presume these are from the catcachebench extension, but I can't figure 
out which scenario above corresponds to which catcachebench test. Also, 
catcachebench seems to depend on a bunch of tables being created in 
schema called "test"; what tables did you use for the above numbers?

- Heikki



Re: Protect syscache from bloating with negative cache entries

От
Kyotaro Horiguchi
Дата:
At Tue, 26 Jan 2021 11:43:21 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> Hi,
> 
> On 19/11/2020 07:25, Kyotaro Horiguchi wrote:
> > Performance measurement on the attached showed better result about
> > searching but maybe worse for cache entry creation.  Each time number
> > is the mean of 10 runs.
> > # Cacache (negative) entry creation
> >             :  time(ms) (% to master)
> > master     :  3965.61    (100.0)
> > patched-off:  4040.93    (101.9)
> > patched-on :  4032.22    (101.7)
> > # Searching negative cache entries
> > master     :  8173.46    (100.0)
> > patched-off:  7983.43    ( 97.7)
> > patched-on :  8049.88    ( 98.5)
> > # Creation, searching and expiration
> > master     :  6393.23    (100.0)
> > patched-off:  6527.94    (102.1)
> > patched-on : 15880.01    (248.4)
> > That is, catcache searching gets faster by 2-3% but creation gets
> > slower by about 2%. If I moved the condition of 2 further up to
> > CatalogCacheCreateEntry(), that degradation reduced to 0.6%.
> > # Cacache (negative) entry creation
> > master      :  3967.45   (100.0)
> > patched-off :  3990.43   (100.6)
> > patched-on  :  4108.96   (103.6)
> > # Searching negative cache entries
> > master      :  8106.53   (100.0)
> > patched-off :  8036.61   ( 99.1)
> > patched-on  :  8058.18   ( 99.4)
> > # Creation, searching and expiration
> > master      :  6395.00   (100.0)
> > patched-off :  6416.57   (100.3)
> > patched-on  : 15830.91   (247.6)
> 
> Can you share the exact script or steps to reproduce these numbers? I
> presume these are from the catcachebench extension, but I can't figure
> out which scenario above corresponds to which catcachebench
> test. Also, catcachebench seems to depend on a bunch of tables being
> created in schema called "test"; what tables did you use for the above
> numbers?

gen_tbl.pl to generate the tables, then run2.sh to run the
benchmark. sumlog.pl to summarize the result of run2.sh.

$ ./gen_tbl.pl | psql postgres
$ ./run2.sh | tee rawresult.txt | ./sumlog.pl

(I found a bug in a benchmark-aid function
(CatalogCacheFlushCatalog2), I repost an updated version soon.)

Simple explanation follows since the scripts are a kind of crappy..

run2.sh:
  LOOPS    : # of execution of catcachebench() in a single run
  USES     : Take the average of this number of fastest executions in a
             single run.
  BINROOT  : Common parent directory of target binaries.
  DATADIR  : Data directory. (shared by all binaries)
  PREC     : FP format for time and percentage in a result.
  TESTS    : comma-separated numbers given to catcachebench.
  
 The "run" function spec

   run "binary-label" <binary-path> <A> <B> <C>
     where A, B and C are the value for catalog_cache_prune_min_age. ""
     means no setting (used for master binary). Currently only C is in
     effect but all the three should be non-empty string to make it
     effective.

 The result output is:

   test |   version   |  n  |    r     | stddev  
  ------+-------------+-----+----------+---------
      1 | patched-off | 1/3 | 14211.96 |  261.19

   test   : # of catcachebench(#)
   version: binary label given to the run function
   n      : USES / LOOPS
   r      : result average time of catcachebench() in milliseconds
   stddev : stddev of

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

#! /usr/bin/perl
$collist = "";
foreach $i (0..1000) {
    $collist .= sprintf(", c%05d int", $i);
}
$collist = substr($collist, 2);

printf "drop schema if exists test cascade;\n";
printf "create schema test;\n";
foreach $i (0..2999) {
    printf "create table test.t%04d ($collist);\n", $i;
}
#!/bin/bash
LOOPS=3
USES=1
TESTS=1,2,3
BINROOT=/home/horiguti/bin
DATADIR=/home/horiguti/data/data_work
PREC="numeric(10,2)"

/usr/bin/killall postgres
/usr/bin/sleep 3

run() {
    local BINARY=$1
    local PGCTL=$2/bin/pg_ctl
    local PGSQL=$2/bin/postgres
    local PSQL=$2/bin/psql

    if [ "$3" != "" ]; then
      local SETTING1="set catalog_cache_prune_min_age to \"$3\";"
      local SETTING2="set catalog_cache_prune_min_age to \"$4\";"
      local SETTING3="set catalog_cache_prune_min_age to \"$5\";"
    fi

#    ($PGSQL -D $DATADIR 2>&1 > /dev/null)&
    ($PGSQL -D $DATADIR 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /')&
    /usr/bin/sleep 3
    ${PSQL} postgres <<EOF
create extension if not exists catcachebench;
select catcachebench(0);

$SETTING3

select distinct * from unnest(ARRAY[${TESTS}]) as test,
LATERAL 
  (select '${BINARY}' as version,
          '${USES}/' || (count(r) OVER())::text as n,
          (avg(r) OVER ())::${PREC},
          (stddev(r) OVER ())::${PREC}
   from (select catcachebench(test) as r
         from generate_series(1, ${LOOPS})) r
   order by r limit ${USES}) r

EOF
    $PGCTL --pgdata=$DATADIR stop 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /'

#    oreport > $BINARY_perf.txt
}

for i in $(seq 0 2); do
run "patched-off" $BINROOT/pgsql_catexp "-1" "-1" "-1"
run "patched-on" $BINROOT/pgsql_catexp "0" "0" "0"
run "master" $BINROOT/pgsql_master_o2 "" "" ""
done

#! /usr/bin/perl

while (<STDIN>) {
#    if (/^\s+([0-9])\s*\|\s*(\w+)\s*\|\s*(\S+)\s*\|\s*([\d.]+)\s*\|\s*(\w+)\s*$/) {
    if (/^\s+([0-9])\s*\|\s*(\S+)\s*\|\s*(\S+)\s*\|\s*([\d.]+)\s*\|\s*([\d.]+)\s*$/) {
        $test = $1;
        $bin = $2;
        $time = $4;
        if (defined $sum{$test}{$bin}) {
            $sum{$test}{$bin} += $time;
            $num{$test}{$bin}++;
        } else {
            $sum{$test}{$bin} = 0;
            $num{$test}{$bin} = 0;
        }
    }
}

foreach $t (sort {$a cmp $b} keys %sum) {
    $master{$t} = $sum{$t}{master} / $num{$t}{master};
}

foreach $t (sort {$a cmp $b} keys %sum) {
    foreach $b (sort {$a cmp $b} keys %{$sum{$t}}) {
        $mean = $sum{$t}{$b} / $num{$t}{$b};
        $ratio = 100.0 * $mean / $master{$t};
        printf("%-13s : %8.2f   (%5.1f) (%d)\n", "$t:$b", $mean, $ratio, $num{$t}{$b});
    }
}
diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile
new file mode 100644
index 0000000000..0478818b25
--- /dev/null
+++ b/contrib/catcachebench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = catcachebench
+OBJS = catcachebench.o
+
+EXTENSION = catcachebench
+DATA = catcachebench--0.0.sql
+PGFILEDESC = "catcachebench - benchmark for catcache pruning feature"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/catcachebench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql
new file mode 100644
index 0000000000..ea9cd62abb
--- /dev/null
+++ b/contrib/catcachebench/catcachebench--0.0.sql
@@ -0,0 +1,14 @@
+/* contrib/catcachebench/catcachebench--0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit
+
+CREATE FUNCTION catcachebench(IN type int)
+RETURNS double precision
+AS 'MODULE_PATHNAME', 'catcachebench'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits
bigint)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'catcachereadstats'
+LANGUAGE C STRICT VOLATILE;
diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
new file mode 100644
index 0000000000..f93d60e721
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.c
@@ -0,0 +1,338 @@
+/*
+ * catcachebench: test code for cache pruning feature
+ */
+/* #define CATCACHE_STATS */
+#include "postgres.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+
+Oid        tableoids[10000];
+int        ntables = 0;
+int16    attnums[1000];
+int        natts = 0;
+
+PG_MODULE_MAGIC;
+
+double catcachebench1(void);
+double catcachebench2(void);
+double catcachebench3(void);
+void collectinfo(void);
+void catcachewarmup(void);
+
+PG_FUNCTION_INFO_V1(catcachebench);
+PG_FUNCTION_INFO_V1(catcachereadstats);
+
+extern void CatalogCacheFlushCatalog2(Oid catId);
+extern int64 catcache_called;
+extern CatCache *SysCache[];
+
+typedef struct catcachestatsstate
+{
+    TupleDesc tupd;
+    int          catId;
+} catcachestatsstate;
+
+Datum
+catcachereadstats(PG_FUNCTION_ARGS)
+{
+    catcachestatsstate *state_data = NULL;
+    FuncCallContext *fctx;
+
+    if (SRF_IS_FIRSTCALL())
+    {
+        TupleDesc    tupdesc;
+        MemoryContext mctx;
+
+        fctx = SRF_FIRSTCALL_INIT();
+        mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+        state_data = palloc(sizeof(catcachestatsstate));
+
+        /* Build a tuple descriptor for our result type */
+        if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+            elog(ERROR, "return type must be a row type");
+
+        state_data->tupd = tupdesc;
+        state_data->catId = 0;
+
+        fctx->user_fctx = state_data;
+
+        MemoryContextSwitchTo(mctx);
+    }
+
+    fctx = SRF_PERCALL_SETUP();
+    state_data = fctx->user_fctx;
+
+    if (state_data->catId < SysCacheSize)
+    {
+        Datum    values[5];
+        bool    nulls[5];
+        HeapTuple    resulttup;
+        Datum    result;
+        int        catId = state_data->catId++;
+
+        memset(nulls, 0, sizeof(nulls));
+        memset(values, 0, sizeof(values));
+        values[0] = Int16GetDatum(catId);
+        values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid);
+#ifdef CATCACHE_STATS        
+        values[2] = Int64GetDatum(SysCache[catId]->cc_searches);
+        values[3] = Int64GetDatum(SysCache[catId]->cc_hits);
+        values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits);
+#endif
+        resulttup = heap_form_tuple(state_data->tupd, values, nulls);
+        result = HeapTupleGetDatum(resulttup);
+
+        SRF_RETURN_NEXT(fctx, result);
+    }
+
+    SRF_RETURN_DONE(fctx);
+}
+
+Datum
+catcachebench(PG_FUNCTION_ARGS)
+{
+    int        testtype = PG_GETARG_INT32(0);
+    double    ms;
+
+    collectinfo();
+
+    /* flush the catalog -- safe? don't mind. */
+    CatalogCacheFlushCatalog2(StatisticRelationId);
+
+    switch (testtype)
+    {
+    case 0:
+        catcachewarmup(); /* prewarm of syscatalog */
+        PG_RETURN_NULL();
+    case 1:
+        ms = catcachebench1(); break;
+    case 2:
+        ms = catcachebench2(); break;
+    case 3:
+        ms = catcachebench3(); break;
+    default:
+        elog(ERROR, "Invalid test type: %d", testtype);
+    }
+
+    PG_RETURN_DATUM(Float8GetDatum(ms));
+}
+
+/*
+ * fetch all attribute entires of all tables.
+ */
+double
+catcachebench1(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of a table 6000 times.
+ */
+double
+catcachebench2(void)
+{
+    int t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (t = 0 ; t < 240000 ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[0]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/* SetCatCacheClock - set catcache timestamp source clock */
+uint64 catcacheclock;
+static inline void
+SetCatCacheClock(TimestampTz ts)
+{
+    catcacheclock = (uint64) ts;
+}
+
+/*
+ * fetch all attribute entires of all tables twice with having expiration
+ * happen.
+ */
+double
+catcachebench3(void)
+{
+    const int clock_step = 1000;
+    int i, t, a;
+    instr_time    start,
+                duration;
+
+    PG_SETMASK(&BlockSig);
+    INSTR_TIME_SET_CURRENT(start);
+    for (i = 0 ; i < 4 ; i++)
+    {
+        int ct = clock_step;
+
+        for (t = 0 ; t < ntables ; t++)
+        {
+            /*
+             * catcacheclock is updated by transaction timestamp, so needs to
+             * be updated by other means for this test to work. Here I choosed
+             * to update the clock every 1000 tables scan.
+             */
+            if (--ct < 0)
+            {
+                SetCatCacheClock(GetCurrentTimestamp());
+                ct = clock_step;
+            }
+            for (a = 0 ; a < natts ; a++)
+            {
+                HeapTuple tup;
+
+                tup = SearchSysCache3(STATRELATTINH,
+                                      ObjectIdGetDatum(tableoids[t]),
+                                      Int16GetDatum(attnums[a]),
+                                      BoolGetDatum(false));
+                /* should be null, but.. */
+                if (HeapTupleIsValid(tup))
+                    ReleaseSysCache(tup);
+            }
+        }
+    }
+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    PG_SETMASK(&UnBlockSig);
+
+    return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+void
+catcachewarmup(void)
+{
+    int t, a;
+
+    /* load up catalog tables */
+    for (t = 0 ; t < ntables ; t++)
+    {
+        for (a = 0 ; a < natts ; a++)
+        {
+            HeapTuple tup;
+
+            tup = SearchSysCache3(STATRELATTINH,
+                                  ObjectIdGetDatum(tableoids[t]),
+                                  Int16GetDatum(attnums[a]),
+                                  BoolGetDatum(false));
+            /* should be null, but.. */
+            if (HeapTupleIsValid(tup))
+                ReleaseSysCache(tup);
+        }
+    }
+}
+
+void
+collectinfo(void)
+{
+    int ret;
+    Datum    values[10000];
+    bool    nulls[10000];
+    Oid        types0[] = {OIDOID};
+    int i;
+
+    ntables = 0;
+    natts = 0;
+
+    SPI_connect();
+    /* collect target tables */
+    ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname =
\'test\')",
+                      true, 0);
+    if (ret != SPI_OK_SELECT)
+        elog(ERROR, "Failed 1");
+    if (SPI_processed == 0)
+        elog(ERROR, "no relation found in schema \"test\"");
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in schema \"test\"");
+
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 2");
+
+        tableoids[ntables++] = DatumGetObjectId(values[0]);
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d tables found", ntables);
+
+    values[0] = ObjectIdGetDatum(tableoids[0]);
+    nulls[0] = false;
+    SPI_connect();
+    ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid
=$1)",
 
+                                1, types0, values, NULL, true, 0);
+    if (SPI_processed == 0)
+        elog(ERROR, "no attribute found in table %d", tableoids[0]);
+    if (SPI_processed > 10000)
+        elog(ERROR, "too many relation found in table %d", tableoids[0]);
+    
+    /* collect target attributes. assuming all tables have the same attnums */
+    for (i = 0 ; i < SPI_processed ; i++)
+    {
+        int16 attnum;
+
+        heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+                          values, nulls);
+        if (nulls[0])
+            elog(ERROR, "Failed 3");
+        attnum = DatumGetInt16(values[0]);
+
+        if (attnum > 0)
+            attnums[natts++] = attnum;
+    }
+    SPI_finish();
+    elog(DEBUG1, "%d attributes found", natts);
+}
diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control
new file mode 100644
index 0000000000..3fc9d2e420
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.control
@@ -0,0 +1,6 @@
+# catcachebench
+
+comment = 'benchmark for catcache pruning'
+default_version = '0.0'
+module_pathname = '$libdir/catcachebench'
+relocatable = true
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index fa2b49c676..11b94504af 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -740,6 +740,41 @@ CatalogCacheFlushCatalog(Oid catId)
     CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
 }
 
+
+/* FUNCTION FOR BENCHMARKING */
+void
+CatalogCacheFlushCatalog2(Oid catId)
+{
+    slist_iter    iter;
+
+    CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId);
+
+    slist_foreach(iter, &CacheHdr->ch_caches)
+    {
+        CatCache   *cache = slist_container(CatCache, cc_next, iter.cur);
+
+        /* Does this cache store tuples of the target catalog? */
+        if (cache->cc_reloid == catId)
+        {
+            /* Yes, so flush all its contents */
+            ResetCatalogCache(cache);
+
+            /* Tell inval.c to call syscache callbacks for this cache */
+            CallSyscacheCallbacks(cache->id, 0);
+
+            cache->cc_nbuckets = 128;
+            pfree(cache->cc_bucket);
+            cache->cc_bucket =
+                (dlist_head *) MemoryContextAllocZero(CacheMemoryContext,
+                                  cache->cc_nbuckets * sizeof(dlist_head));
+            elog(LOG, "Catcache reset");
+        }
+    }
+
+    CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
+}
+/* END: FUNCTION FOR BENCHMARKING */
+
 /*
  *        InitCatCache
  *
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index e4dc4ee34e..b60416ec63 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -994,7 +994,7 @@ static const struct cachedesc cacheinfo[] = {
     }
 };
 
-static CatCache *SysCache[SysCacheSize];
+CatCache *SysCache[SysCacheSize];
 
 static bool CacheInitialized = false;