Обсуждение: [HACKERS] shared memory based stat collector (was: Sharing record typmodsbetween backends)

Поиск

Список

Период

Сортировка

[HACKERS] shared memory based stat collector (was: Sharing record typmodsbetween backends)

От

Andres Freund

Дата:

14 августа 2017 г., 06:56:56

Hi,

Since we're getting a bit into the weeds of a different topic, and since
I think it's an interesting feature, I'm detaching this into a separate
thread.

On 2017-08-12 23:37:27 -0400, Tom Lane wrote:
> >> On 2017-08-12 22:52:57 -0400, Robert Haas wrote:
> >>> I think it'd be pretty interesting to look at replacing parts of the
> >>> stats collector machinery with something DHT-based.
> > On Sat, Aug 12, 2017 at 11:30 PM, Andres Freund <andres@anarazel.de> wrote:
> >> That seems to involve a lot more than this though, given that currently
> >> the stats collector data doesn't entirely have to be in memory. I've
> >> seen sites with a lot of databases with quite some per-database stats
> >> data. Don't think we can just require that to be in memory :(
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > Hmm.  I'm not sure it wouldn't end up being *less* memory.  Don't we
> > end up caching 1 copy of it per backend, at least for the database to
> > which that backend is connected?  Accessing a shared copy would avoid
> > that sort of thing.
>
> Yeah ... the collector itself has got all that in memory anyway.
> We do need to think about synchronization issues if we make that
> memory globally available, but I find it hard to see how that would
> lead to more memory consumption overall than what happens now.

You both are obviously right.  Another point of potential concern could
be that we'd pretyt much fully rely on dsm/dht's being available, for
the server to function correctly. Are we ok with that? Right now
postgres still works perfectly well, leaving parallelism aside, with
dynamic_shared_memory_type = none.

What are your thoughts about how to actually implement this? It seems
we'd have to do something like:

1) Keep the current per-backend & per-transaction state in each  backend. That allows both to throw away the
informationand avoids  increasing contention quite noticeably.

2) Some plain shared memory with metadata.  A set of shared hashtables  for per database, per relation contents.

3) Individual database/relation entries are either individual atomics  (we don't rely on consistency anyway), or
seqcount(like  st_changecount) based.

4) Instead of sending stats at transaction end, copy them into a  "pending" entry.  Nontransactional contents can be
movedto  the pending entry more frequently.

5) Occasionally, try to flush the pending array into the global hash.  The lookup in the table would be protected by
something LWLockConditionalAcquire() based, to avoid blocking - don't want to  introduce chokepoints due to commonly
usedtables and such.  Updating  the actual stats can happen without the partition locks being held.

I think there's two other relevant points here:

a) It'd be quite useful to avoid needing a whole cluster's stats in  memory. Even if $subject would save memory, I'm
hesitantcommitting  to something requiring all stats to be in memory forever. As a first  step it seems reasonable to
e.g.not require state for all databases  to be in memory. The first time per-database stats are required, it  could be
"pagedin". Could even be more aggressive and do that on a  per-table level and just have smaller placeholder entries
for non-accessed tables, but that seems more work.

  On the other hand, autoavcuum is likely going to make that approach  useless anyway, given it's probably going to
accessotherwise unneded  stats regularly.

b) I think our tendency to dump all stats whenever we crash isn't really  tenable, given how autovacuum etc are tied to
them.We should think  about ways to avoid that if we're going to do a major rewrite of the  stats stuff, which this
certainlysounds like.

If there weren't HS to worry about, these two points kinda sound like
the data should be persisted into an actual table, rather than some
weird other storage format. But HS seems to make that untenable.

Greetings,

Andres Freund

Re: [HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

От

Tom Lane

Дата:

14 августа 2017 г., 07:59:30

Andres Freund <andres@anarazel.de> writes:
> You both are obviously right.  Another point of potential concern could
> be that we'd pretyt much fully rely on dsm/dht's being available, for
> the server to function correctly. Are we ok with that? Right now
> postgres still works perfectly well, leaving parallelism aside, with
> dynamic_shared_memory_type = none.

In principle we could keep the existing mechanism as a fallback.
Whether that's worth the trouble is debatable.  The current code
in initdb believes that every platform has some type of DSM support
(see choose_dsm_implementation).  Nobody's complained about that,
and it certainly works on every buildfarm animal.  So for all we know,
dynamic_shared_memory_type = none is broken already.

> I think there's two other relevant points here:

> a) It'd be quite useful to avoid needing a whole cluster's stats in
>    memory. Even if $subject would save memory, I'm hesitant committing
>    to something requiring all stats to be in memory forever. As a first
>    step it seems reasonable to e.g. not require state for all databases
>    to be in memory. The first time per-database stats are required, it
>    could be "paged in". Could even be more aggressive and do that on a
>    per-table level and just have smaller placeholder entries for
>    non-accessed tables, but that seems more work.

Huh?  As long as we don't lock the shared memory into RAM, which on most
platforms we couldn't do without being root anyway, ISTM the kernel should
do just fine at paging out little-used areas of the stats.  Let's not
reinvent that wheel until there's demonstrable proof of need.

> b) I think our tendency to dump all stats whenever we crash isn't really
>    tenable, given how autovacuum etc are tied to them.

Eh, maybe.  I don't think crashes are really so common on production
systems.  As developers we have an inflated view of their frequency ;-)
        regards, tom lane

Re: [HACKERS] shared memory based stat collector (was: Sharing recordtypmods between backends)

От

Robert Haas

Дата:

14 августа 2017 г., 21:46:10

On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> You both are obviously right.  Another point of potential concern could
>> be that we'd pretyt much fully rely on dsm/dht's being available, for
>> the server to function correctly. Are we ok with that? Right now
>> postgres still works perfectly well, leaving parallelism aside, with
>> dynamic_shared_memory_type = none.
>
> In principle we could keep the existing mechanism as a fallback.
> Whether that's worth the trouble is debatable.  The current code
> in initdb believes that every platform has some type of DSM support
> (see choose_dsm_implementation).  Nobody's complained about that,
> and it certainly works on every buildfarm animal.  So for all we know,
> dynamic_shared_memory_type = none is broken already.

Actually, now that you mention it, I think it *is* broken already, or
more to the point, if you configure that value, the autovacuum
launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro
added.  When I just tested it, the AV launcher somehow ended up
waiting for AutovacuumLock and I had to SIGQUIT the server to shut it
down.  That's actually not really entirely the fault of
dynamic_shared_memory_type = none, though, because the code in
autovacuum.c does this:
               AutoVacuumDSA = dsa_create(AutovacuumLock->tranche);               /* make sure it doesn't go away even
ifwe do */               dsa_pin(AutoVacuumDSA);               dsa_pin_mapping(AutoVacuumDSA);

Now, that's actually really broken because if dsa_create() throws an
error of any kind, you're going to have already assigned the value to
AutoVacuumDSA, but you will not have pinned the DSA or the DSA
mapping.  There's evidently some additional bug here because I'd sorta
expect this code to just go into an infinite loop in this case,
failing over and over trying to reattach the segment, but evidently
something even worse happening - perhaps the ERROR isn't releasing
AutovacuumLock.  Anyway, this code has multiple error handling defects
and IMHO it's pretty questionable whether DSA should have been used
here at all.  Allowing the number of autovacuum work items to grow
without bound would not be a good design - and if we've got a bound
somewhere, why not just put this in the main shared memory segment?

Leaving all that aside, when the DSM facility was introduced in 9.4, I
was afraid that it would break things for people and that they'd
demand a way to turn it off, and I included "none" as an option for
that purpose.  Well, it did break a few things initially but those
seem to have mostly been fixed during the 9.4 cycle itself.  I'm not
aware of any subsequent problem reports caused by having DSM enabled
(pointers welcome!) and given that we now have parallel query relying
on DSM, people are much less likely to want to just give up on having
DSM available.  If somebody has a system where no other form of shared
memory, works dynamic_shared_memory_type = mmap is still a thing, so
the use case for "none" seems very thin indeed.  I'd vote for just
ripping it out in v11.

>> a) It'd be quite useful to avoid needing a whole cluster's stats in
>>    memory. Even if $subject would save memory, I'm hesitant committing
>>    to something requiring all stats to be in memory forever. As a first
>>    step it seems reasonable to e.g. not require state for all databases
>>    to be in memory. The first time per-database stats are required, it
>>    could be "paged in". Could even be more aggressive and do that on a
>>    per-table level and just have smaller placeholder entries for
>>    non-accessed tables, but that seems more work.
>
> Huh?  As long as we don't lock the shared memory into RAM, which on most
> platforms we couldn't do without being root anyway, ISTM the kernel should
> do just fine at paging out little-used areas of the stats.  Let's not
> reinvent that wheel until there's demonstrable proof of need.

I think some systems aren't able to page out data stored in shared
memory segments, and it would probably be pretty awful for performance
if they did (there's a reason large sorts need to switch to using temp
files ... and the same rationale seems to apply here).  That having
been said, I don't see a need to change everything at once.  If we
moved the stats collector data from frequently-rewritten files to
shared memory, we'd already be saving a lot of I/O and possibly memory
utilization.  If somebody then wanted to refine that further by
spilling rarely used data out to disk and reloading it on demand, that
could be done as a follow-on patch.  But I think that would only be
needed for people with *really* large numbers of tables, and it would
only help them if most of those tables were barely ever touched.

>> b) I think our tendency to dump all stats whenever we crash isn't really
>>    tenable, given how autovacuum etc are tied to them.
>
> Eh, maybe.  I don't think crashes are really so common on production
> systems.  As developers we have an inflated view of their frequency ;-)

Without taking a position on the point under debate, AFAIK it wouldn't
be technically complex either under our current architecture or the
proposed new one to dump out a new permanent stats file every 10
minutes or so.  So if there is an issue here I think it might not be
very hard to fix, whatever else we do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

Andres Freund

Дата:

14 августа 2017 г., 22:00:25

Hi,

On 2017-08-14 11:46:10 -0400, Robert Haas wrote:
> On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Andres Freund <andres@anarazel.de> writes:
> >> You both are obviously right.  Another point of potential concern could
> >> be that we'd pretyt much fully rely on dsm/dht's being available, for
> >> the server to function correctly. Are we ok with that? Right now
> >> postgres still works perfectly well, leaving parallelism aside, with
> >> dynamic_shared_memory_type = none.
> >
> > In principle we could keep the existing mechanism as a fallback.
> > Whether that's worth the trouble is debatable.  The current code
> > in initdb believes that every platform has some type of DSM support
> > (see choose_dsm_implementation).  Nobody's complained about that,
> > and it certainly works on every buildfarm animal.  So for all we know,
> > dynamic_shared_memory_type = none is broken already.
> 
> Actually, now that you mention it, I think it *is* broken already, or
> more to the point, if you configure that value, the autovacuum
> launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro
> added.  When I just tested it, the AV launcher somehow ended up
> waiting for AutovacuumLock and I had to SIGQUIT the server to shut it
> down.  That's actually not really entirely the fault of
> dynamic_shared_memory_type = none, though, because the code in
> autovacuum.c does this:
> 
>                 AutoVacuumDSA = dsa_create(AutovacuumLock->tranche);
>                 /* make sure it doesn't go away even if we do */
>                 dsa_pin(AutoVacuumDSA);
>                 dsa_pin_mapping(AutoVacuumDSA);
> 
> Now, that's actually really broken because if dsa_create() throws an
> error of any kind, you're going to have already assigned the value to
> AutoVacuumDSA, but you will not have pinned the DSA or the DSA
> mapping.  There's evidently some additional bug here because I'd sorta
> expect this code to just go into an infinite loop in this case,
> failing over and over trying to reattach the segment, but evidently
> something even worse happening - perhaps the ERROR isn't releasing
> AutovacuumLock.  Anyway, this code has multiple error handling defects
> and IMHO it's pretty questionable whether DSA should have been used
> here at all.  Allowing the number of autovacuum work items to grow
> without bound would not be a good design - and if we've got a bound
> somewhere, why not just put this in the main shared memory segment?

CCing Alvaro.  This seems like it should be an open item.


> Leaving all that aside, when the DSM facility was introduced in 9.4, I
> was afraid that it would break things for people and that they'd
> demand a way to turn it off, and I included "none" as an option for
> that purpose.  Well, it did break a few things initially but those
> seem to have mostly been fixed during the 9.4 cycle itself.  I'm not
> aware of any subsequent problem reports caused by having DSM enabled
> (pointers welcome!) and given that we now have parallel query relying
> on DSM, people are much less likely to want to just give up on having
> DSM available.  If somebody has a system where no other form of shared
> memory, works dynamic_shared_memory_type = mmap is still a thing, so
> the use case for "none" seems very thin indeed.  I'd vote for just
> ripping it out in v11.

It's possibly still useful for debugging - we'll continue not to be able
to rely on allocations to succeed...


> >> a) It'd be quite useful to avoid needing a whole cluster's stats in
> >>    memory. Even if $subject would save memory, I'm hesitant committing
> >>    to something requiring all stats to be in memory forever. As a first
> >>    step it seems reasonable to e.g. not require state for all databases
> >>    to be in memory. The first time per-database stats are required, it
> >>    could be "paged in". Could even be more aggressive and do that on a
> >>    per-table level and just have smaller placeholder entries for
> >>    non-accessed tables, but that seems more work.
> >
> > Huh?  As long as we don't lock the shared memory into RAM, which on most
> > platforms we couldn't do without being root anyway, ISTM the kernel should
> > do just fine at paging out little-used areas of the stats.  Let's not
> > reinvent that wheel until there's demonstrable proof of need.
> 
> I think some systems aren't able to page out data stored in shared
> memory segments, and it would probably be pretty awful for performance
> if they did (there's a reason large sorts need to switch to using temp
> files ... and the same rationale seems to apply here).

Right.


> That having been said, I don't see a need to change everything at
> once.  If we moved the stats collector data from frequently-rewritten
> files to shared memory, we'd already be saving a lot of I/O and
> possibly memory utilization.

Right #7


> >> b) I think our tendency to dump all stats whenever we crash isn't really
> >>    tenable, given how autovacuum etc are tied to them.
> >
> > Eh, maybe.  I don't think crashes are really so common on production
> > systems.  As developers we have an inflated view of their frequency ;-)

FWIW, in my experience crashes are far from rare. And most shutdowns /
restarts are immediate ones in my experience.


> Without taking a position on the point under debate, AFAIK it wouldn't
> be technically complex either under our current architecture or the
> proposed new one to dump out a new permanent stats file every 10
> minutes or so.  So if there is an issue here I think it might not be
> very hard to fix, whatever else we do.

There's some issues around that, primarily because how dropped tables
are handled.  It's not entirely trivial to avoid having old entries
around after a dropped tables, which then even can lead to conflicts if
oids are reused.


Greetings,

Andres Freund

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

Alvaro Herrera

Дата:

14 августа 2017 г., 22:09:05

Robert Haas wrote:

> Actually, now that you mention it, I think it *is* broken already, or
> more to the point, if you configure that value, the autovacuum
> launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro
> added.  When I just tested it, the AV launcher somehow ended up
> waiting for AutovacuumLock and I had to SIGQUIT the server to shut it
> down.  That's actually not really entirely the fault of
> dynamic_shared_memory_type = none, though, because the code in
> autovacuum.c does this:
> 
>                 AutoVacuumDSA = dsa_create(AutovacuumLock->tranche);
>                 /* make sure it doesn't go away even if we do */
>                 dsa_pin(AutoVacuumDSA);
>                 dsa_pin_mapping(AutoVacuumDSA);
> 
> Now, that's actually really broken because if dsa_create() throws an
> error of any kind, you're going to have already assigned the value to
> AutoVacuumDSA, but you will not have pinned the DSA or the DSA
> mapping.  There's evidently some additional bug here because I'd sorta
> expect this code to just go into an infinite loop in this case,
> failing over and over trying to reattach the segment, but evidently
> something even worse happening - perhaps the ERROR isn't releasing
> AutovacuumLock.

Looking into this now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

От

Tom Lane

Дата:

14 августа 2017 г., 22:16:05

Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In principle we could keep the existing mechanism as a fallback.
>> Whether that's worth the trouble is debatable.  The current code
>> in initdb believes that every platform has some type of DSM support
>> (see choose_dsm_implementation).  Nobody's complained about that,
>> and it certainly works on every buildfarm animal.  So for all we know,
>> dynamic_shared_memory_type = none is broken already.

> Actually, now that you mention it, I think it *is* broken already, or
> more to the point, if you configure that value, the autovacuum
> launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro
> added.  When I just tested it, the AV launcher somehow ended up
> waiting for AutovacuumLock and I had to SIGQUIT the server to shut it
> down.

Hmm, shouldn't that be an open item for v10?

> ... If somebody has a system where no other form of shared
> memory, works dynamic_shared_memory_type = mmap is still a thing, so
> the use case for "none" seems very thin indeed.  I'd vote for just
> ripping it out in v11.

Just FYI, the only values being reported by buildfarm animals are
"posix", "sysv", and "windows".  So while mmap may be a thing,
it's an untested thing.
dsm_type  | critters_reporting 
-----------+--------------------posix     |                 69sysv      |                 13windows\r |
10
(3 rows)
        regards, tom lane

Re: [HACKERS] shared memory based stat collector (was: Sharing recordtypmods between backends)

От

Robert Haas

Дата:

14 августа 2017 г., 22:21:30

On Mon, Aug 14, 2017 at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> In principle we could keep the existing mechanism as a fallback.
>>> Whether that's worth the trouble is debatable.  The current code
>>> in initdb believes that every platform has some type of DSM support
>>> (see choose_dsm_implementation).  Nobody's complained about that,
>>> and it certainly works on every buildfarm animal.  So for all we know,
>>> dynamic_shared_memory_type = none is broken already.
>
>> Actually, now that you mention it, I think it *is* broken already, or
>> more to the point, if you configure that value, the autovacuum
>> launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro
>> added.  When I just tested it, the AV launcher somehow ended up
>> waiting for AutovacuumLock and I had to SIGQUIT the server to shut it
>> down.
>
> Hmm, shouldn't that be an open item for v10?

I already added it.

>> ... If somebody has a system where no other form of shared
>> memory, works dynamic_shared_memory_type = mmap is still a thing, so
>> the use case for "none" seems very thin indeed.  I'd vote for just
>> ripping it out in v11.
>
> Just FYI, the only values being reported by buildfarm animals are
> "posix", "sysv", and "windows".  So while mmap may be a thing,
> it's an untested thing.

I'm pretty sure I dev-tested it before committing anything, but,
certainly, having ongoing BF coverage woudn't be a bad thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

Andres Freund

Дата:

14 августа 2017 г., 22:24:20

On 2017-08-14 12:21:30 -0400, Robert Haas wrote:
> >> ... If somebody has a system where no other form of shared
> >> memory, works dynamic_shared_memory_type = mmap is still a thing, so
> >> the use case for "none" seems very thin indeed.  I'd vote for just
> >> ripping it out in v11.
> >
> > Just FYI, the only values being reported by buildfarm animals are
> > "posix", "sysv", and "windows".  So while mmap may be a thing,
> > it's an untested thing.
> 
> I'm pretty sure I dev-tested it before committing anything, but,
> certainly, having ongoing BF coverage woudn't be a bad thing.

Is there any platforms that require it?  I thought posix, sysv and
windows are sufficient?  If we're going to start relying more on dsms
we probably don't want to use mmap anyway...

- Andres

Re: [HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

От

Tom Lane

Дата:

14 августа 2017 г., 22:28:39

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Aug 14, 2017 at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Just FYI, the only values being reported by buildfarm animals are
>> "posix", "sysv", and "windows".  So while mmap may be a thing,
>> it's an untested thing.

> I'm pretty sure I dev-tested it before committing anything, but,
> certainly, having ongoing BF coverage woudn't be a bad thing.

Looking closer, the reason those are the only reported values is
that those are the only possible results from initdb's
choose_dsm_implementation().  So the real question here is whether
"mmap" should be considered to dominate "sysv" if it's available.
If so, why isn't choose_dsm_implementation() trying it; and if not,
why are we carrying it?
        regards, tom lane

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

Andres Freund

Дата:

14 августа 2017 г., 22:36:10

On 2017-08-14 12:28:39 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Mon, Aug 14, 2017 at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Just FYI, the only values being reported by buildfarm animals are
> >> "posix", "sysv", and "windows".  So while mmap may be a thing,
> >> it's an untested thing.
> 
> > I'm pretty sure I dev-tested it before committing anything, but,
> > certainly, having ongoing BF coverage woudn't be a bad thing.
> 
> Looking closer, the reason those are the only reported values is
> that those are the only possible results from initdb's
> choose_dsm_implementation().  So the real question here is whether
> "mmap" should be considered to dominate "sysv" if it's available.

No mmap isn't a good option - it's file backed mmap, rather than
anonymous mmap. To my knowledge there's no good portable way to use
anonymous mmap to share memory across processes unless established
before a fork().


> If so, why isn't choose_dsm_implementation() trying it; and if not,
> why are we carrying it?

I think the idea was that there might be platforms that require it, but
...


Greetings,

Andres Freund

Re: [HACKERS] shared memory based stat collector (was: Sharing recordtypmods between backends)

От

Magnus Hagander

Дата:

14 августа 2017 г., 22:57:39

On Mon, Aug 14, 2017 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> b) I think our tendency to dump all stats whenever we crash isn't really
>> tenable, given how autovacuum etc are tied to them.
>
> Eh, maybe. I don't think crashes are really so common on production
> systems. As developers we have an inflated view of their frequency ;-)

From a stats perspective I think the crash problem is the small problem. The big problem is we nuke them on replication promotion and we nuke them on PITR. If we solve those two, I'm pretty sure we would also solve the on-crash more or less in the same thing.

Without taking a position on the point under debate, AFAIK it wouldn't
be technically complex either under our current architecture or the
proposed new one to dump out a new permanent stats file every 10
minutes or so. So if there is an issue here I think it might not be
very hard to fix, whatever else we do.

I started working on that patch at some point, I think I have a rotting branch somewhere. That part was indeed fairly easy.

I got stalled when I feature-crept myself by realizing I wanted it snapshotted to WAL so it could fix the PITR and replication issues. And then properly bogged down when I realized that on the standby I'd want *both* the stats from the standby (while it's running) and the stats from the master (after switchover).

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: [HACKERS] shared memory based stat collector (was: Sharing recordtypmods between backends)

От

Robert Haas

Дата:

14 августа 2017 г., 23:00:41

On Mon, Aug 14, 2017 at 12:36 PM, Andres Freund <andres@anarazel.de> wrote:
>> If so, why isn't choose_dsm_implementation() trying it; and if not,
>> why are we carrying it?
>
> I think the idea was that there might be platforms that require it, but
> ...

Right.  So, for example, POSIX shared memory will fail on Linux is
/dev/shm is inaccessible, and I've seen such systems in the wild.
System V shared memory will fail if the kernel limits are too small.
On *BSD, which lacks POSIX shared memory altogether, whether you can
start PostgreSQL with the default configuration settings is depends
entirely on how the System V shared memory limits are configured.

I think it would be a bad idea to remove both
dynamic_shared_memory_type=none and dynamic_shared_memory_type=mmap.
If you do that, then somebody who doesn't have root and whose system
configuration is messed up can't start PostgreSQL at all.  While I
understand that catering to rarely-used options has some cost, I don't
think those costs are exorbitant.  And, I think history shows that
minimizing dependencies on operating-system settings is a win.  Commit
b0fc0df9364d2d2d17c0162cf3b8b59f6cb09f67 may not be my
most-appreciated commit ever, but it undoubtedly had the highest
appreciation-to-effort ratio.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

Andres Freund

Дата:

14 августа 2017 г., 23:02:40

On 2017-08-14 18:57:39 +0200, Magnus Hagander wrote:
> On Mon, Aug 14, 2017 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> b) I think our tendency to dump all stats whenever we crash isn't really
> > >>    tenable, given how autovacuum etc are tied to them.
> > >
> > > Eh, maybe.  I don't think crashes are really so common on production
> > > systems.  As developers we have an inflated view of their frequency ;-)
> >
> 
> From a stats perspective I think the crash problem is the small problem.
> The big problem is we nuke them on replication promotion and we nuke them
> on PITR. If we solve those two, I'm pretty sure we would also solve the
> on-crash more or less in the same thing.

I don't think they're that rare, but the others are problems
too. Unfortunately I think changing that is a bit more complicated,
given that some stats are actually updated on standbys.

I previously thought that an option to occasionally WAL log the stats
file would be useful (e.g. just before a checkpoint). That'd make them
persistent, and available on the standby. But that'd still require
somehow dealing with stats being produced on the standby - I presume
we'd need multiple stats files and provide functions for merging them.

It'd be good if we somehow could figure out a way to WAL log the stats
data in a way that's consistent, i.e. doesn't contain data about dumped
relations. But I don't quite see how...

> > Without taking a position on the point under debate, AFAIK it wouldn't
> > be technically complex either under our current architecture or the
> > proposed new one to dump out a new permanent stats file every 10
> > minutes or so.  So if there is an issue here I think it might not be
> > very hard to fix, whatever else we do.
> >
> 
> I started working on that patch at some point, I think I have a rotting
> branch somewhere. That part was indeed fairly easy.
> 
> I got stalled when I feature-crept myself by realizing I wanted it
> snapshotted to WAL so it could fix the PITR and replication issues. And
> then properly bogged down when I realized that on the standby I'd want
> *both* the stats from the standby (while it's running) and the stats from
> the master (after switchover).

Hah, indeed.

Greetings,

Andres Freund

Re: [HACKERS] shared memory based stat collector (was: Sharing recordtypmods between backends)

От

Robert Haas

Дата:

14 августа 2017 г., 23:06:48

On Mon, Aug 14, 2017 at 1:02 PM, Andres Freund <andres@anarazel.de> wrote:
> I previously thought that an option to occasionally WAL log the stats
> file would be useful (e.g. just before a checkpoint). That'd make them
> persistent, and available on the standby. But that'd still require
> somehow dealing with stats being produced on the standby - I presume
> we'd need multiple stats files and provide functions for merging them.

Well, if you think the stats files might be too big to fit in memory,
then you're probably imagining tens of gigabytes of data, and you're
not going to want to write a WAL record that size, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

Andres Freund

Дата:

14 августа 2017 г., 23:12:16

On 2017-08-14 13:06:48 -0400, Robert Haas wrote:
> On Mon, Aug 14, 2017 at 1:02 PM, Andres Freund <andres@anarazel.de> wrote:
> > I previously thought that an option to occasionally WAL log the stats
> > file would be useful (e.g. just before a checkpoint). That'd make them
> > persistent, and available on the standby. But that'd still require
> > somehow dealing with stats being produced on the standby - I presume
> > we'd need multiple stats files and provide functions for merging them.
> 
> Well, if you think the stats files might be too big to fit in memory,
> then you're probably imagining tens of gigabytes of data, and you're
> not going to want to write a WAL record that size, either.

Well, that's why I'm thinking of it having to be an option.  You
presumably don't want it on relatively idle servers, but if you have big
databases where unnecessary vacuums are bad...

I think again, this all'd be better if we'd figure out a way to make
this use a proper table... We kind of have figured out how to store data
in a queryable manner, with buffering etc. Kind of.  Wonder if there's
some chance of a hack where we have a shared table with a key like
(dboid, reloid, inreplication). If we were to create those (once for
replication, once for not) at table creation, we could
heap_inplace_update them even on a standby...

- Andres

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

Alvaro Herrera

Дата:

14 августа 2017 г., 23:12:19

Robert Haas wrote:

> Actually, now that you mention it, I think it *is* broken already, or
> more to the point, if you configure that value, the autovacuum
> launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro
> added.  When I just tested it, the AV launcher somehow ended up
> waiting for AutovacuumLock and I had to SIGQUIT the server to shut it
> down.  That's actually not really entirely the fault of
> dynamic_shared_memory_type = none, though, because the code in
> autovacuum.c does this:
> 
>                 AutoVacuumDSA = dsa_create(AutovacuumLock->tranche);
>                 /* make sure it doesn't go away even if we do */
>                 dsa_pin(AutoVacuumDSA);
>                 dsa_pin_mapping(AutoVacuumDSA);
> 
> Now, that's actually really broken because if dsa_create() throws an
> error of any kind, you're going to have already assigned the value to
> AutoVacuumDSA, but you will not have pinned the DSA or the DSA
> mapping.  There's evidently some additional bug here because I'd sorta
> expect this code to just go into an infinite loop in this case,
> failing over and over trying to reattach the segment, but evidently
> something even worse happening - perhaps the ERROR isn't releasing
> AutovacuumLock.

Yeah, the problem that lwlocks aren't released is because the launcher
is not in a transaction at that point, so AbortCurrentTransaction()
doesn't release locks like it normally would.  The simplest fix (in the
attached 0001 patch) is to add a LWLockReleaseAll() call to the jmp
block, though I wonder if there should be some other cleanup functions
called from there, or whether perhaps it'd be a better strategy to have
the launcher run in a transaction at all times.

The other problem is that there's no attempt to handle a failed DSA
creation/attachment.  The second patch just adds a PG_TRY block that
sets a flag not to try the DSA calls again if the first one fails.  It
throws a single ERROR line, then autovacuum continues without workitem
support.

I intend to give these patches further thought before pushing anything,
will update this thread no later than tomorrow 19:00 UTC-0300.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

От

Tom Lane

Дата:

14 августа 2017 г., 23:26:39

Robert Haas <robertmhaas@gmail.com> writes:
> I think it would be a bad idea to remove both
> dynamic_shared_memory_type=none and dynamic_shared_memory_type=mmap.
> If you do that, then somebody who doesn't have root and whose system
> configuration is messed up can't start PostgreSQL at all.

I'd be happy to leave the mmap option in place if it were getting
tested regularly.  As things stand, I'm doubtful that someone who
was stuck in this hypothetical bad place could rely on it to work
at all.
        regards, tom lane

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

David Fetter

Дата:

15 августа 2017 г., 00:17:05

On Sun, Aug 13, 2017 at 05:56:56PM -0700, Andres Freund wrote:
> Hi,
> 
> Since we're getting a bit into the weeds of a different topic, and since
> I think it's an interesting feature, I'm detaching this into a separate
> thread.
> 
> On 2017-08-12 23:37:27 -0400, Tom Lane wrote:
> > >> On 2017-08-12 22:52:57 -0400, Robert Haas wrote:
> > >>> I think it'd be pretty interesting to look at replacing parts of the
> > >>> stats collector machinery with something DHT-based.
> > > On Sat, Aug 12, 2017 at 11:30 PM, Andres Freund <andres@anarazel.de> wrote:
> > >> That seems to involve a lot more than this though, given that currently
> > >> the stats collector data doesn't entirely have to be in memory. I've
> > >> seen sites with a lot of databases with quite some per-database stats
> > >> data. Don't think we can just require that to be in memory :(
> >
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > Hmm.  I'm not sure it wouldn't end up being *less* memory.  Don't we
> > > end up caching 1 copy of it per backend, at least for the database to
> > > which that backend is connected?  Accessing a shared copy would avoid
> > > that sort of thing.
> >
> > Yeah ... the collector itself has got all that in memory anyway.
> > We do need to think about synchronization issues if we make that
> > memory globally available, but I find it hard to see how that would
> > lead to more memory consumption overall than what happens now.
> 
> You both are obviously right.  Another point of potential concern could
> be that we'd pretyt much fully rely on dsm/dht's being available, for
> the server to function correctly. Are we ok with that? Right now
> postgres still works perfectly well, leaving parallelism aside, with
> dynamic_shared_memory_type = none.
> 
> 
> What are your thoughts about how to actually implement this? It seems
> we'd have to do something like:
> 
> 1) Keep the current per-backend & per-transaction state in each
>    backend. That allows both to throw away the information and avoids
>    increasing contention quite noticeably.
> 
> 2) Some plain shared memory with metadata.  A set of shared hashtables
>    for per database, per relation contents.
> 
> 3) Individual database/relation entries are either individual atomics
>    (we don't rely on consistency anyway), or seqcount (like
>    st_changecount) based.
> 
> 4) Instead of sending stats at transaction end, copy them into a
>    "pending" entry.  Nontransactional contents can be moved to
>    the pending entry more frequently.
> 
> 5) Occasionally, try to flush the pending array into the global hash.
>    The lookup in the table would be protected by something
>    LWLockConditionalAcquire() based, to avoid blocking - don't want to
>    introduce chokepoints due to commonly used tables and such.  Updating
>    the actual stats can happen without the partition locks being held.
> 
> I think there's two other relevant points here:
> 
> a) It'd be quite useful to avoid needing a whole cluster's stats in
>    memory. Even if $subject would save memory, I'm hesitant committing
>    to something requiring all stats to be in memory forever. As a first
>    step it seems reasonable to e.g. not require state for all databases
>    to be in memory. The first time per-database stats are required, it
>    could be "paged in". Could even be more aggressive and do that on a
>    per-table level and just have smaller placeholder entries for
>    non-accessed tables, but that seems more work.
> 
>    On the other hand, autoavcuum is likely going to make that approach
>    useless anyway, given it's probably going to access otherwise unneded
>    stats regularly.
> 
> b) I think our tendency to dump all stats whenever we crash isn't really
>    tenable, given how autovacuum etc are tied to them. We should think
>    about ways to avoid that if we're going to do a major rewrite of the
>    stats stuff, which this certainly sounds like.
> 
> 
> If there weren't HS to worry about, these two points kinda sound like
> the data should be persisted into an actual table, rather than some
> weird other storage format. But HS seems to make that untenable.
> 
> Greetings,
> 
> Andres Freund

As I recall, David Gould (hi!) had run across a case where there were
thousand of tables and the stats file became a pretty serious
bottleneck.  There might even be a design or even code to address it.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] shared memory based stat collector (was: Sharing recordtypmods between backends)

От

Robert Haas

Дата:

15 августа 2017 г., 01:09:58

On Mon, Aug 14, 2017 at 1:12 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Yeah, the problem that lwlocks aren't released is because the launcher
> is not in a transaction at that point, so AbortCurrentTransaction()
> doesn't release locks like it normally would.  The simplest fix (in the
> attached 0001 patch) is to add a LWLockReleaseAll() call to the jmp
> block, though I wonder if there should be some other cleanup functions
> called from there, or whether perhaps it'd be a better strategy to have
> the launcher run in a transaction at all times.

Well, if you're going to do aborts outside of a transaction, just
adding an LWLockReleaseAll() isn't really sufficient.  You need to
look at something like CheckpointerMain() and figure out which of
those push-ups are needed here as well.  Probably at least   ConditionVariableCancelSleep(), pgstat_report_wait_end(),
AbortBufferIO(), and UnlockBuffers() -- quite possibly some of those
other AtEOXact calls as well.

> The other problem is that there's no attempt to handle a failed DSA
> creation/attachment.  The second patch just adds a PG_TRY block that
> sets a flag not to try the DSA calls again if the first one fails.  It
> throws a single ERROR line, then autovacuum continues without workitem
> support.

Yeah, and the other question -- which Thomas asked before you
originally committed originally, and which I just now asked again is
"Why in the world are you using DSA for this at all?".  There are
serious problems with that which both he and I have pointed out, and
you haven't explained why it's a good idea at any point, AFAICT.
Among those problems:

1. It doesn't work if dynamic_shared_memory_type=none.  That's OK for
an optional subsystem, but autovacuum is not very optional.

and

2. It allows unbounded bloat if there's no limit on the number of work
items and is pointless is there is since you could then just use the
main shared memory segment.

I really think you should respond to those concerns, not just push a
minimal fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

Alvaro Herrera

Дата:

15 августа 2017 г., 05:16:39

Robert Haas wrote:
> On Mon, Aug 14, 2017 at 1:12 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Yeah, the problem that lwlocks aren't released is because the launcher
> > is not in a transaction at that point, so AbortCurrentTransaction()
> > doesn't release locks like it normally would.  The simplest fix (in the
> > attached 0001 patch) is to add a LWLockReleaseAll() call to the jmp
> > block, though I wonder if there should be some other cleanup functions
> > called from there, or whether perhaps it'd be a better strategy to have
> > the launcher run in a transaction at all times.
> 
> Well, if you're going to do aborts outside of a transaction, just
> adding an LWLockReleaseAll() isn't really sufficient.  You need to
> look at something like CheckpointerMain() and figure out which of
> those push-ups are needed here as well.  Probably at least
>     ConditionVariableCancelSleep(), pgstat_report_wait_end(),
> AbortBufferIO(), and UnlockBuffers() -- quite possibly some of those
> other AtEOXact calls as well.

Agreed.  I think a saner answer is to create a single function that does
it all, and use it in all places that need it, rather than keep adding
more copies of the same thing.  Attached revised version of patch does
things that way; I think it's good cleanup.  (I had to make the
ResourceOwnerRelease call be conditional on there being a resowner;
seems okay to me.)

I put the new cleanup routine in xact.c, which is not exactly the
perfect place (had to add access/xact.h to a couple of files), but the
fact that the new routine is a cut-down version of another routine in
xact.c makes it clear to me that that's where it belongs.  I first put
it in bootstrap, alongside AuxiliaryProcessMain, but after a while that
seemed wrong.

> > The other problem is that there's no attempt to handle a failed DSA
> > creation/attachment.  The second patch just adds a PG_TRY block that
> > sets a flag not to try the DSA calls again if the first one fails.  It
> > throws a single ERROR line, then autovacuum continues without
> > workitem support.
> 
> Yeah, and the other question -- which Thomas asked before you
> originally committed originally, and which I just now asked again is
> "Why in the world are you using DSA for this at all?".  There are
> serious problems with that which both he and I have pointed out, and
> you haven't explained why it's a good idea at any point, AFAICT.

The main reason is that I envision that the workitem stuff will be used
for other things in the future than just brin summarization, and it
seemed a lame idea to just use a fixed-size memory area in the standard
autovacuum shared memory area.  I think unbounded growth is of course
going to be bad.  The current coding doesn't allow for any growth beyond
the initial fixed size, but it's easier to extend the system from the
current point rather than wholly changing shared memory usage pattern
while at it.

I thought I *had* responded to Thomas in that thread, BTW.

> Among those problems:
> 
> 1. It doesn't work if dynamic_shared_memory_type=none.  That's OK for
> an optional subsystem, but autovacuum is not very optional.

Autovacuum as a whole continues to work if there's no dynamic shared
memory; it's just the workitems stuff that stops working if there's no
DSA.  (After fixing the bug that makes it crash in the case of
dynamic_shared_memory_type=none, of course).

> 2. It allows unbounded bloat if there's no limit on the number of work
> items and is pointless is there is since you could then just use the
> main shared memory segment.

Yeah, there are probably better strategies than just growing the memory
area every time another entry is needed.  Work-items as a whole need a
lot more development from the current point.

> I really think you should respond to those concerns, not just push a
> minimal fix.

We'll just continue to develop things from the current point.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

v2-0001-Release-lwlocks-in-autovacuum-launcher-error-hand.patch

Re: [HACKERS] shared memory based stat collector (was: Sharing recordtypmods between backends)

От

Robert Haas

Дата:

15 августа 2017 г., 19:32:55

On Mon, Aug 14, 2017 at 7:16 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> Yeah, and the other question -- which Thomas asked before you
>> originally committed originally, and which I just now asked again is
>> "Why in the world are you using DSA for this at all?".  There are
>> serious problems with that which both he and I have pointed out, and
>> you haven't explained why it's a good idea at any point, AFAICT.
>
> The main reason is that I envision that the workitem stuff will be used
> for other things in the future than just brin summarization, and it
> seemed a lame idea to just use a fixed-size memory area in the standard
> autovacuum shared memory area.  I think unbounded growth is of course
> going to be bad.  The current coding doesn't allow for any growth beyond
> the initial fixed size, but it's easier to extend the system from the
> current point rather than wholly changing shared memory usage pattern
> while at it.

I don't accept that argument.  All the current code does is allocate
one fixed-size chunk of memory in DSA, so the whole pattern would have
to be changed *anyway* if you wanted to grow the array.   It's not
like you are allocating the items one-by-one.

I really, really strongly encourage you to rip the use of DSA out here
entirely.  It is reducing the reliability of a critical part of the
system for no actual benefit other than speculation that this is going
to be better in the future, and it adds a bunch of failure cases that
we could just as well live without.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] shared memory based stat collector (was: Sharing record typmods between backends)

От

Tom Lane

Дата:

15 августа 2017 г., 19:54:58

Robert Haas <robertmhaas@gmail.com> writes:
> I really, really strongly encourage you to rip the use of DSA out here
> entirely.  It is reducing the reliability of a critical part of the
> system for no actual benefit other than speculation that this is going
> to be better in the future, and it adds a bunch of failure cases that
> we could just as well live without.

FWIW, I vote with Robert on this.  When and if you actually want to make
that array resizable, it'd be time to introduce use of a DSA.  But right
now we need to be looking for simple and reliable solutions for v10.

In particular, since right now dsm_type = NONE is still considered
supported, we can't really have autovacuum facilities that are dependent
on being able to use DSM.  That might change in future, but not today.
        regards, tom lane

Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)

От

Alvaro Herrera

Дата:

15 августа 2017 г., 20:48:51

Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I really, really strongly encourage you to rip the use of DSA out here
> > entirely.  It is reducing the reliability of a critical part of the
> > system for no actual benefit other than speculation that this is going
> > to be better in the future, and it adds a bunch of failure cases that
> > we could just as well live without.
> 
> FWIW, I vote with Robert on this.  When and if you actually want to make
> that array resizable, it'd be time to introduce use of a DSA.  But right
> now we need to be looking for simple and reliable solutions for v10.

Okay, I'll make it so.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [HACKERS] shared memory based stat collector (was: Sharing record typmodsbetween backends)

Вложения

Вложения