Обсуждение: [HACKERS] shared memory based stat collector (was: Sharing record typmodsbetween backends)
[HACKERS] shared memory based stat collector (was: Sharing record typmodsbetween backends)
От
Andres Freund
Дата:
Hi, Since we're getting a bit into the weeds of a different topic, and since I think it's an interesting feature, I'm detaching this into a separate thread. On 2017-08-12 23:37:27 -0400, Tom Lane wrote: > >> On 2017-08-12 22:52:57 -0400, Robert Haas wrote: > >>> I think it'd be pretty interesting to look at replacing parts of the > >>> stats collector machinery with something DHT-based. > > On Sat, Aug 12, 2017 at 11:30 PM, Andres Freund <andres@anarazel.de> wrote: > >> That seems to involve a lot more than this though, given that currently > >> the stats collector data doesn't entirely have to be in memory. I've > >> seen sites with a lot of databases with quite some per-database stats > >> data. Don't think we can just require that to be in memory :( > > Robert Haas <robertmhaas@gmail.com> writes: > > Hmm. I'm not sure it wouldn't end up being *less* memory. Don't we > > end up caching 1 copy of it per backend, at least for the database to > > which that backend is connected? Accessing a shared copy would avoid > > that sort of thing. > > Yeah ... the collector itself has got all that in memory anyway. > We do need to think about synchronization issues if we make that > memory globally available, but I find it hard to see how that would > lead to more memory consumption overall than what happens now. You both are obviously right. Another point of potential concern could be that we'd pretyt much fully rely on dsm/dht's being available, for the server to function correctly. Are we ok with that? Right now postgres still works perfectly well, leaving parallelism aside, with dynamic_shared_memory_type = none. What are your thoughts about how to actually implement this? It seems we'd have to do something like: 1) Keep the current per-backend & per-transaction state in each backend. That allows both to throw away the informationand avoids increasing contention quite noticeably. 2) Some plain shared memory with metadata. A set of shared hashtables for per database, per relation contents. 3) Individual database/relation entries are either individual atomics (we don't rely on consistency anyway), or seqcount(like st_changecount) based. 4) Instead of sending stats at transaction end, copy them into a "pending" entry. Nontransactional contents can be movedto the pending entry more frequently. 5) Occasionally, try to flush the pending array into the global hash. The lookup in the table would be protected by something LWLockConditionalAcquire() based, to avoid blocking - don't want to introduce chokepoints due to commonly usedtables and such. Updating the actual stats can happen without the partition locks being held. I think there's two other relevant points here: a) It'd be quite useful to avoid needing a whole cluster's stats in memory. Even if $subject would save memory, I'm hesitantcommitting to something requiring all stats to be in memory forever. As a first step it seems reasonable to e.g.not require state for all databases to be in memory. The first time per-database stats are required, it could be "pagedin". Could even be more aggressive and do that on a per-table level and just have smaller placeholder entries for non-accessed tables, but that seems more work. On the other hand, autoavcuum is likely going to make that approach useless anyway, given it's probably going to accessotherwise unneded stats regularly. b) I think our tendency to dump all stats whenever we crash isn't really tenable, given how autovacuum etc are tied to them.We should think about ways to avoid that if we're going to do a major rewrite of the stats stuff, which this certainlysounds like. If there weren't HS to worry about, these two points kinda sound like the data should be persisted into an actual table, rather than some weird other storage format. But HS seems to make that untenable. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > You both are obviously right. Another point of potential concern could > be that we'd pretyt much fully rely on dsm/dht's being available, for > the server to function correctly. Are we ok with that? Right now > postgres still works perfectly well, leaving parallelism aside, with > dynamic_shared_memory_type = none. In principle we could keep the existing mechanism as a fallback. Whether that's worth the trouble is debatable. The current code in initdb believes that every platform has some type of DSM support (see choose_dsm_implementation). Nobody's complained about that, and it certainly works on every buildfarm animal. So for all we know, dynamic_shared_memory_type = none is broken already. > I think there's two other relevant points here: > a) It'd be quite useful to avoid needing a whole cluster's stats in > memory. Even if $subject would save memory, I'm hesitant committing > to something requiring all stats to be in memory forever. As a first > step it seems reasonable to e.g. not require state for all databases > to be in memory. The first time per-database stats are required, it > could be "paged in". Could even be more aggressive and do that on a > per-table level and just have smaller placeholder entries for > non-accessed tables, but that seems more work. Huh? As long as we don't lock the shared memory into RAM, which on most platforms we couldn't do without being root anyway, ISTM the kernel should do just fine at paging out little-used areas of the stats. Let's not reinvent that wheel until there's demonstrable proof of need. > b) I think our tendency to dump all stats whenever we crash isn't really > tenable, given how autovacuum etc are tied to them. Eh, maybe. I don't think crashes are really so common on production systems. As developers we have an inflated view of their frequency ;-) regards, tom lane
On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> You both are obviously right. Another point of potential concern could >> be that we'd pretyt much fully rely on dsm/dht's being available, for >> the server to function correctly. Are we ok with that? Right now >> postgres still works perfectly well, leaving parallelism aside, with >> dynamic_shared_memory_type = none. > > In principle we could keep the existing mechanism as a fallback. > Whether that's worth the trouble is debatable. The current code > in initdb believes that every platform has some type of DSM support > (see choose_dsm_implementation). Nobody's complained about that, > and it certainly works on every buildfarm animal. So for all we know, > dynamic_shared_memory_type = none is broken already. Actually, now that you mention it, I think it *is* broken already, or more to the point, if you configure that value, the autovacuum launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro added. When I just tested it, the AV launcher somehow ended up waiting for AutovacuumLock and I had to SIGQUIT the server to shut it down. That's actually not really entirely the fault of dynamic_shared_memory_type = none, though, because the code in autovacuum.c does this: AutoVacuumDSA = dsa_create(AutovacuumLock->tranche); /* make sure it doesn't go away even ifwe do */ dsa_pin(AutoVacuumDSA); dsa_pin_mapping(AutoVacuumDSA); Now, that's actually really broken because if dsa_create() throws an error of any kind, you're going to have already assigned the value to AutoVacuumDSA, but you will not have pinned the DSA or the DSA mapping. There's evidently some additional bug here because I'd sorta expect this code to just go into an infinite loop in this case, failing over and over trying to reattach the segment, but evidently something even worse happening - perhaps the ERROR isn't releasing AutovacuumLock. Anyway, this code has multiple error handling defects and IMHO it's pretty questionable whether DSA should have been used here at all. Allowing the number of autovacuum work items to grow without bound would not be a good design - and if we've got a bound somewhere, why not just put this in the main shared memory segment? Leaving all that aside, when the DSM facility was introduced in 9.4, I was afraid that it would break things for people and that they'd demand a way to turn it off, and I included "none" as an option for that purpose. Well, it did break a few things initially but those seem to have mostly been fixed during the 9.4 cycle itself. I'm not aware of any subsequent problem reports caused by having DSM enabled (pointers welcome!) and given that we now have parallel query relying on DSM, people are much less likely to want to just give up on having DSM available. If somebody has a system where no other form of shared memory, works dynamic_shared_memory_type = mmap is still a thing, so the use case for "none" seems very thin indeed. I'd vote for just ripping it out in v11. >> a) It'd be quite useful to avoid needing a whole cluster's stats in >> memory. Even if $subject would save memory, I'm hesitant committing >> to something requiring all stats to be in memory forever. As a first >> step it seems reasonable to e.g. not require state for all databases >> to be in memory. The first time per-database stats are required, it >> could be "paged in". Could even be more aggressive and do that on a >> per-table level and just have smaller placeholder entries for >> non-accessed tables, but that seems more work. > > Huh? As long as we don't lock the shared memory into RAM, which on most > platforms we couldn't do without being root anyway, ISTM the kernel should > do just fine at paging out little-used areas of the stats. Let's not > reinvent that wheel until there's demonstrable proof of need. I think some systems aren't able to page out data stored in shared memory segments, and it would probably be pretty awful for performance if they did (there's a reason large sorts need to switch to using temp files ... and the same rationale seems to apply here). That having been said, I don't see a need to change everything at once. If we moved the stats collector data from frequently-rewritten files to shared memory, we'd already be saving a lot of I/O and possibly memory utilization. If somebody then wanted to refine that further by spilling rarely used data out to disk and reloading it on demand, that could be done as a follow-on patch. But I think that would only be needed for people with *really* large numbers of tables, and it would only help them if most of those tables were barely ever touched. >> b) I think our tendency to dump all stats whenever we crash isn't really >> tenable, given how autovacuum etc are tied to them. > > Eh, maybe. I don't think crashes are really so common on production > systems. As developers we have an inflated view of their frequency ;-) Without taking a position on the point under debate, AFAIK it wouldn't be technically complex either under our current architecture or the proposed new one to dump out a new permanent stats file every 10 minutes or so. So if there is an issue here I think it might not be very hard to fix, whatever else we do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2017-08-14 11:46:10 -0400, Robert Haas wrote: > On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andres Freund <andres@anarazel.de> writes: > >> You both are obviously right. Another point of potential concern could > >> be that we'd pretyt much fully rely on dsm/dht's being available, for > >> the server to function correctly. Are we ok with that? Right now > >> postgres still works perfectly well, leaving parallelism aside, with > >> dynamic_shared_memory_type = none. > > > > In principle we could keep the existing mechanism as a fallback. > > Whether that's worth the trouble is debatable. The current code > > in initdb believes that every platform has some type of DSM support > > (see choose_dsm_implementation). Nobody's complained about that, > > and it certainly works on every buildfarm animal. So for all we know, > > dynamic_shared_memory_type = none is broken already. > > Actually, now that you mention it, I think it *is* broken already, or > more to the point, if you configure that value, the autovacuum > launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro > added. When I just tested it, the AV launcher somehow ended up > waiting for AutovacuumLock and I had to SIGQUIT the server to shut it > down. That's actually not really entirely the fault of > dynamic_shared_memory_type = none, though, because the code in > autovacuum.c does this: > > AutoVacuumDSA = dsa_create(AutovacuumLock->tranche); > /* make sure it doesn't go away even if we do */ > dsa_pin(AutoVacuumDSA); > dsa_pin_mapping(AutoVacuumDSA); > > Now, that's actually really broken because if dsa_create() throws an > error of any kind, you're going to have already assigned the value to > AutoVacuumDSA, but you will not have pinned the DSA or the DSA > mapping. There's evidently some additional bug here because I'd sorta > expect this code to just go into an infinite loop in this case, > failing over and over trying to reattach the segment, but evidently > something even worse happening - perhaps the ERROR isn't releasing > AutovacuumLock. Anyway, this code has multiple error handling defects > and IMHO it's pretty questionable whether DSA should have been used > here at all. Allowing the number of autovacuum work items to grow > without bound would not be a good design - and if we've got a bound > somewhere, why not just put this in the main shared memory segment? CCing Alvaro. This seems like it should be an open item. > Leaving all that aside, when the DSM facility was introduced in 9.4, I > was afraid that it would break things for people and that they'd > demand a way to turn it off, and I included "none" as an option for > that purpose. Well, it did break a few things initially but those > seem to have mostly been fixed during the 9.4 cycle itself. I'm not > aware of any subsequent problem reports caused by having DSM enabled > (pointers welcome!) and given that we now have parallel query relying > on DSM, people are much less likely to want to just give up on having > DSM available. If somebody has a system where no other form of shared > memory, works dynamic_shared_memory_type = mmap is still a thing, so > the use case for "none" seems very thin indeed. I'd vote for just > ripping it out in v11. It's possibly still useful for debugging - we'll continue not to be able to rely on allocations to succeed... > >> a) It'd be quite useful to avoid needing a whole cluster's stats in > >> memory. Even if $subject would save memory, I'm hesitant committing > >> to something requiring all stats to be in memory forever. As a first > >> step it seems reasonable to e.g. not require state for all databases > >> to be in memory. The first time per-database stats are required, it > >> could be "paged in". Could even be more aggressive and do that on a > >> per-table level and just have smaller placeholder entries for > >> non-accessed tables, but that seems more work. > > > > Huh? As long as we don't lock the shared memory into RAM, which on most > > platforms we couldn't do without being root anyway, ISTM the kernel should > > do just fine at paging out little-used areas of the stats. Let's not > > reinvent that wheel until there's demonstrable proof of need. > > I think some systems aren't able to page out data stored in shared > memory segments, and it would probably be pretty awful for performance > if they did (there's a reason large sorts need to switch to using temp > files ... and the same rationale seems to apply here). Right. > That having been said, I don't see a need to change everything at > once. If we moved the stats collector data from frequently-rewritten > files to shared memory, we'd already be saving a lot of I/O and > possibly memory utilization. Right #7 > >> b) I think our tendency to dump all stats whenever we crash isn't really > >> tenable, given how autovacuum etc are tied to them. > > > > Eh, maybe. I don't think crashes are really so common on production > > systems. As developers we have an inflated view of their frequency ;-) FWIW, in my experience crashes are far from rare. And most shutdowns / restarts are immediate ones in my experience. > Without taking a position on the point under debate, AFAIK it wouldn't > be technically complex either under our current architecture or the > proposed new one to dump out a new permanent stats file every 10 > minutes or so. So if there is an issue here I think it might not be > very hard to fix, whatever else we do. There's some issues around that, primarily because how dropped tables are handled. It's not entirely trivial to avoid having old entries around after a dropped tables, which then even can lead to conflicts if oids are reused. Greetings, Andres Freund
Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)
От
Alvaro Herrera
Дата:
Robert Haas wrote: > Actually, now that you mention it, I think it *is* broken already, or > more to the point, if you configure that value, the autovacuum > launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro > added. When I just tested it, the AV launcher somehow ended up > waiting for AutovacuumLock and I had to SIGQUIT the server to shut it > down. That's actually not really entirely the fault of > dynamic_shared_memory_type = none, though, because the code in > autovacuum.c does this: > > AutoVacuumDSA = dsa_create(AutovacuumLock->tranche); > /* make sure it doesn't go away even if we do */ > dsa_pin(AutoVacuumDSA); > dsa_pin_mapping(AutoVacuumDSA); > > Now, that's actually really broken because if dsa_create() throws an > error of any kind, you're going to have already assigned the value to > AutoVacuumDSA, but you will not have pinned the DSA or the DSA > mapping. There's evidently some additional bug here because I'd sorta > expect this code to just go into an infinite loop in this case, > failing over and over trying to reattach the segment, but evidently > something even worse happening - perhaps the ERROR isn't releasing > AutovacuumLock. Looking into this now. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In principle we could keep the existing mechanism as a fallback. >> Whether that's worth the trouble is debatable. The current code >> in initdb believes that every platform has some type of DSM support >> (see choose_dsm_implementation). Nobody's complained about that, >> and it certainly works on every buildfarm animal. So for all we know, >> dynamic_shared_memory_type = none is broken already. > Actually, now that you mention it, I think it *is* broken already, or > more to the point, if you configure that value, the autovacuum > launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro > added. When I just tested it, the AV launcher somehow ended up > waiting for AutovacuumLock and I had to SIGQUIT the server to shut it > down. Hmm, shouldn't that be an open item for v10? > ... If somebody has a system where no other form of shared > memory, works dynamic_shared_memory_type = mmap is still a thing, so > the use case for "none" seems very thin indeed. I'd vote for just > ripping it out in v11. Just FYI, the only values being reported by buildfarm animals are "posix", "sysv", and "windows". So while mmap may be a thing, it's an untested thing. dsm_type | critters_reporting -----------+--------------------posix | 69sysv | 13windows\r | 10 (3 rows) regards, tom lane
On Mon, Aug 14, 2017 at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> In principle we could keep the existing mechanism as a fallback. >>> Whether that's worth the trouble is debatable. The current code >>> in initdb believes that every platform has some type of DSM support >>> (see choose_dsm_implementation). Nobody's complained about that, >>> and it certainly works on every buildfarm animal. So for all we know, >>> dynamic_shared_memory_type = none is broken already. > >> Actually, now that you mention it, I think it *is* broken already, or >> more to the point, if you configure that value, the autovacuum >> launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro >> added. When I just tested it, the AV launcher somehow ended up >> waiting for AutovacuumLock and I had to SIGQUIT the server to shut it >> down. > > Hmm, shouldn't that be an open item for v10? I already added it. >> ... If somebody has a system where no other form of shared >> memory, works dynamic_shared_memory_type = mmap is still a thing, so >> the use case for "none" seems very thin indeed. I'd vote for just >> ripping it out in v11. > > Just FYI, the only values being reported by buildfarm animals are > "posix", "sysv", and "windows". So while mmap may be a thing, > it's an untested thing. I'm pretty sure I dev-tested it before committing anything, but, certainly, having ongoing BF coverage woudn't be a bad thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-08-14 12:21:30 -0400, Robert Haas wrote: > >> ... If somebody has a system where no other form of shared > >> memory, works dynamic_shared_memory_type = mmap is still a thing, so > >> the use case for "none" seems very thin indeed. I'd vote for just > >> ripping it out in v11. > > > > Just FYI, the only values being reported by buildfarm animals are > > "posix", "sysv", and "windows". So while mmap may be a thing, > > it's an untested thing. > > I'm pretty sure I dev-tested it before committing anything, but, > certainly, having ongoing BF coverage woudn't be a bad thing. Is there any platforms that require it? I thought posix, sysv and windows are sufficient? If we're going to start relying more on dsms we probably don't want to use mmap anyway... - Andres
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Aug 14, 2017 at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Just FYI, the only values being reported by buildfarm animals are >> "posix", "sysv", and "windows". So while mmap may be a thing, >> it's an untested thing. > I'm pretty sure I dev-tested it before committing anything, but, > certainly, having ongoing BF coverage woudn't be a bad thing. Looking closer, the reason those are the only reported values is that those are the only possible results from initdb's choose_dsm_implementation(). So the real question here is whether "mmap" should be considered to dominate "sysv" if it's available. If so, why isn't choose_dsm_implementation() trying it; and if not, why are we carrying it? regards, tom lane
On 2017-08-14 12:28:39 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Mon, Aug 14, 2017 at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Just FYI, the only values being reported by buildfarm animals are > >> "posix", "sysv", and "windows". So while mmap may be a thing, > >> it's an untested thing. > > > I'm pretty sure I dev-tested it before committing anything, but, > > certainly, having ongoing BF coverage woudn't be a bad thing. > > Looking closer, the reason those are the only reported values is > that those are the only possible results from initdb's > choose_dsm_implementation(). So the real question here is whether > "mmap" should be considered to dominate "sysv" if it's available. No mmap isn't a good option - it's file backed mmap, rather than anonymous mmap. To my knowledge there's no good portable way to use anonymous mmap to share memory across processes unless established before a fork(). > If so, why isn't choose_dsm_implementation() trying it; and if not, > why are we carrying it? I think the idea was that there might be platforms that require it, but ... Greetings, Andres Freund
Re: [HACKERS] shared memory based stat collector (was: Sharing recordtypmods between backends)
От
Magnus Hagander
Дата:
On Mon, Aug 14, 2017 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> b) I think our tendency to dump all stats whenever we crash isn't really
>> tenable, given how autovacuum etc are tied to them.
>
> Eh, maybe. I don't think crashes are really so common on production
> systems. As developers we have an inflated view of their frequency ;-)
From a stats perspective I think the crash problem is the small problem. The big problem is we nuke them on replication promotion and we nuke them on PITR. If we solve those two, I'm pretty sure we would also solve the on-crash more or less in the same thing.
Without taking a position on the point under debate, AFAIK it wouldn't
be technically complex either under our current architecture or the
proposed new one to dump out a new permanent stats file every 10
minutes or so. So if there is an issue here I think it might not be
very hard to fix, whatever else we do.
I started working on that patch at some point, I think I have a rotting branch somewhere. That part was indeed fairly easy.
I got stalled when I feature-crept myself by realizing I wanted it snapshotted to WAL so it could fix the PITR and replication issues. And then properly bogged down when I realized that on the standby I'd want *both* the stats from the standby (while it's running) and the stats from the master (after switchover).
On Mon, Aug 14, 2017 at 12:36 PM, Andres Freund <andres@anarazel.de> wrote: >> If so, why isn't choose_dsm_implementation() trying it; and if not, >> why are we carrying it? > > I think the idea was that there might be platforms that require it, but > ... Right. So, for example, POSIX shared memory will fail on Linux is /dev/shm is inaccessible, and I've seen such systems in the wild. System V shared memory will fail if the kernel limits are too small. On *BSD, which lacks POSIX shared memory altogether, whether you can start PostgreSQL with the default configuration settings is depends entirely on how the System V shared memory limits are configured. I think it would be a bad idea to remove both dynamic_shared_memory_type=none and dynamic_shared_memory_type=mmap. If you do that, then somebody who doesn't have root and whose system configuration is messed up can't start PostgreSQL at all. While I understand that catering to rarely-used options has some cost, I don't think those costs are exorbitant. And, I think history shows that minimizing dependencies on operating-system settings is a win. Commit b0fc0df9364d2d2d17c0162cf3b8b59f6cb09f67 may not be my most-appreciated commit ever, but it undoubtedly had the highest appreciation-to-effort ratio. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-08-14 18:57:39 +0200, Magnus Hagander wrote: > On Mon, Aug 14, 2017 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > On Sun, Aug 13, 2017 at 9:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> b) I think our tendency to dump all stats whenever we crash isn't really > > >> tenable, given how autovacuum etc are tied to them. > > > > > > Eh, maybe. I don't think crashes are really so common on production > > > systems. As developers we have an inflated view of their frequency ;-) > > > > From a stats perspective I think the crash problem is the small problem. > The big problem is we nuke them on replication promotion and we nuke them > on PITR. If we solve those two, I'm pretty sure we would also solve the > on-crash more or less in the same thing. I don't think they're that rare, but the others are problems too. Unfortunately I think changing that is a bit more complicated, given that some stats are actually updated on standbys. I previously thought that an option to occasionally WAL log the stats file would be useful (e.g. just before a checkpoint). That'd make them persistent, and available on the standby. But that'd still require somehow dealing with stats being produced on the standby - I presume we'd need multiple stats files and provide functions for merging them. It'd be good if we somehow could figure out a way to WAL log the stats data in a way that's consistent, i.e. doesn't contain data about dumped relations. But I don't quite see how... > > Without taking a position on the point under debate, AFAIK it wouldn't > > be technically complex either under our current architecture or the > > proposed new one to dump out a new permanent stats file every 10 > > minutes or so. So if there is an issue here I think it might not be > > very hard to fix, whatever else we do. > > > > I started working on that patch at some point, I think I have a rotting > branch somewhere. That part was indeed fairly easy. > > I got stalled when I feature-crept myself by realizing I wanted it > snapshotted to WAL so it could fix the PITR and replication issues. And > then properly bogged down when I realized that on the standby I'd want > *both* the stats from the standby (while it's running) and the stats from > the master (after switchover). Hah, indeed. Greetings, Andres Freund
On Mon, Aug 14, 2017 at 1:02 PM, Andres Freund <andres@anarazel.de> wrote: > I previously thought that an option to occasionally WAL log the stats > file would be useful (e.g. just before a checkpoint). That'd make them > persistent, and available on the standby. But that'd still require > somehow dealing with stats being produced on the standby - I presume > we'd need multiple stats files and provide functions for merging them. Well, if you think the stats files might be too big to fit in memory, then you're probably imagining tens of gigabytes of data, and you're not going to want to write a WAL record that size, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-08-14 13:06:48 -0400, Robert Haas wrote: > On Mon, Aug 14, 2017 at 1:02 PM, Andres Freund <andres@anarazel.de> wrote: > > I previously thought that an option to occasionally WAL log the stats > > file would be useful (e.g. just before a checkpoint). That'd make them > > persistent, and available on the standby. But that'd still require > > somehow dealing with stats being produced on the standby - I presume > > we'd need multiple stats files and provide functions for merging them. > > Well, if you think the stats files might be too big to fit in memory, > then you're probably imagining tens of gigabytes of data, and you're > not going to want to write a WAL record that size, either. Well, that's why I'm thinking of it having to be an option. You presumably don't want it on relatively idle servers, but if you have big databases where unnecessary vacuums are bad... I think again, this all'd be better if we'd figure out a way to make this use a proper table... We kind of have figured out how to store data in a queryable manner, with buffering etc. Kind of. Wonder if there's some chance of a hack where we have a shared table with a key like (dboid, reloid, inreplication). If we were to create those (once for replication, once for not) at table creation, we could heap_inplace_update them even on a standby... - Andres
Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)
От
Alvaro Herrera
Дата:
Robert Haas wrote: > Actually, now that you mention it, I think it *is* broken already, or > more to the point, if you configure that value, the autovacuum > launcher hangs up, because of the AutovacuumWorkItem stuff that Alvaro > added. When I just tested it, the AV launcher somehow ended up > waiting for AutovacuumLock and I had to SIGQUIT the server to shut it > down. That's actually not really entirely the fault of > dynamic_shared_memory_type = none, though, because the code in > autovacuum.c does this: > > AutoVacuumDSA = dsa_create(AutovacuumLock->tranche); > /* make sure it doesn't go away even if we do */ > dsa_pin(AutoVacuumDSA); > dsa_pin_mapping(AutoVacuumDSA); > > Now, that's actually really broken because if dsa_create() throws an > error of any kind, you're going to have already assigned the value to > AutoVacuumDSA, but you will not have pinned the DSA or the DSA > mapping. There's evidently some additional bug here because I'd sorta > expect this code to just go into an infinite loop in this case, > failing over and over trying to reattach the segment, but evidently > something even worse happening - perhaps the ERROR isn't releasing > AutovacuumLock. Yeah, the problem that lwlocks aren't released is because the launcher is not in a transaction at that point, so AbortCurrentTransaction() doesn't release locks like it normally would. The simplest fix (in the attached 0001 patch) is to add a LWLockReleaseAll() call to the jmp block, though I wonder if there should be some other cleanup functions called from there, or whether perhaps it'd be a better strategy to have the launcher run in a transaction at all times. The other problem is that there's no attempt to handle a failed DSA creation/attachment. The second patch just adds a PG_TRY block that sets a flag not to try the DSA calls again if the first one fails. It throws a single ERROR line, then autovacuum continues without workitem support. I intend to give these patches further thought before pushing anything, will update this thread no later than tomorrow 19:00 UTC-0300. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Robert Haas <robertmhaas@gmail.com> writes: > I think it would be a bad idea to remove both > dynamic_shared_memory_type=none and dynamic_shared_memory_type=mmap. > If you do that, then somebody who doesn't have root and whose system > configuration is messed up can't start PostgreSQL at all. I'd be happy to leave the mmap option in place if it were getting tested regularly. As things stand, I'm doubtful that someone who was stuck in this hypothetical bad place could rely on it to work at all. regards, tom lane
On Sun, Aug 13, 2017 at 05:56:56PM -0700, Andres Freund wrote: > Hi, > > Since we're getting a bit into the weeds of a different topic, and since > I think it's an interesting feature, I'm detaching this into a separate > thread. > > On 2017-08-12 23:37:27 -0400, Tom Lane wrote: > > >> On 2017-08-12 22:52:57 -0400, Robert Haas wrote: > > >>> I think it'd be pretty interesting to look at replacing parts of the > > >>> stats collector machinery with something DHT-based. > > > On Sat, Aug 12, 2017 at 11:30 PM, Andres Freund <andres@anarazel.de> wrote: > > >> That seems to involve a lot more than this though, given that currently > > >> the stats collector data doesn't entirely have to be in memory. I've > > >> seen sites with a lot of databases with quite some per-database stats > > >> data. Don't think we can just require that to be in memory :( > > > > Robert Haas <robertmhaas@gmail.com> writes: > > > Hmm. I'm not sure it wouldn't end up being *less* memory. Don't we > > > end up caching 1 copy of it per backend, at least for the database to > > > which that backend is connected? Accessing a shared copy would avoid > > > that sort of thing. > > > > Yeah ... the collector itself has got all that in memory anyway. > > We do need to think about synchronization issues if we make that > > memory globally available, but I find it hard to see how that would > > lead to more memory consumption overall than what happens now. > > You both are obviously right. Another point of potential concern could > be that we'd pretyt much fully rely on dsm/dht's being available, for > the server to function correctly. Are we ok with that? Right now > postgres still works perfectly well, leaving parallelism aside, with > dynamic_shared_memory_type = none. > > > What are your thoughts about how to actually implement this? It seems > we'd have to do something like: > > 1) Keep the current per-backend & per-transaction state in each > backend. That allows both to throw away the information and avoids > increasing contention quite noticeably. > > 2) Some plain shared memory with metadata. A set of shared hashtables > for per database, per relation contents. > > 3) Individual database/relation entries are either individual atomics > (we don't rely on consistency anyway), or seqcount (like > st_changecount) based. > > 4) Instead of sending stats at transaction end, copy them into a > "pending" entry. Nontransactional contents can be moved to > the pending entry more frequently. > > 5) Occasionally, try to flush the pending array into the global hash. > The lookup in the table would be protected by something > LWLockConditionalAcquire() based, to avoid blocking - don't want to > introduce chokepoints due to commonly used tables and such. Updating > the actual stats can happen without the partition locks being held. > > I think there's two other relevant points here: > > a) It'd be quite useful to avoid needing a whole cluster's stats in > memory. Even if $subject would save memory, I'm hesitant committing > to something requiring all stats to be in memory forever. As a first > step it seems reasonable to e.g. not require state for all databases > to be in memory. The first time per-database stats are required, it > could be "paged in". Could even be more aggressive and do that on a > per-table level and just have smaller placeholder entries for > non-accessed tables, but that seems more work. > > On the other hand, autoavcuum is likely going to make that approach > useless anyway, given it's probably going to access otherwise unneded > stats regularly. > > b) I think our tendency to dump all stats whenever we crash isn't really > tenable, given how autovacuum etc are tied to them. We should think > about ways to avoid that if we're going to do a major rewrite of the > stats stuff, which this certainly sounds like. > > > If there weren't HS to worry about, these two points kinda sound like > the data should be persisted into an actual table, rather than some > weird other storage format. But HS seems to make that untenable. > > Greetings, > > Andres Freund As I recall, David Gould (hi!) had run across a case where there were thousand of tables and the stats file became a pretty serious bottleneck. There might even be a design or even code to address it. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Mon, Aug 14, 2017 at 1:12 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Yeah, the problem that lwlocks aren't released is because the launcher > is not in a transaction at that point, so AbortCurrentTransaction() > doesn't release locks like it normally would. The simplest fix (in the > attached 0001 patch) is to add a LWLockReleaseAll() call to the jmp > block, though I wonder if there should be some other cleanup functions > called from there, or whether perhaps it'd be a better strategy to have > the launcher run in a transaction at all times. Well, if you're going to do aborts outside of a transaction, just adding an LWLockReleaseAll() isn't really sufficient. You need to look at something like CheckpointerMain() and figure out which of those push-ups are needed here as well. Probably at least ConditionVariableCancelSleep(), pgstat_report_wait_end(), AbortBufferIO(), and UnlockBuffers() -- quite possibly some of those other AtEOXact calls as well. > The other problem is that there's no attempt to handle a failed DSA > creation/attachment. The second patch just adds a PG_TRY block that > sets a flag not to try the DSA calls again if the first one fails. It > throws a single ERROR line, then autovacuum continues without workitem > support. Yeah, and the other question -- which Thomas asked before you originally committed originally, and which I just now asked again is "Why in the world are you using DSA for this at all?". There are serious problems with that which both he and I have pointed out, and you haven't explained why it's a good idea at any point, AFAICT. Among those problems: 1. It doesn't work if dynamic_shared_memory_type=none. That's OK for an optional subsystem, but autovacuum is not very optional. and 2. It allows unbounded bloat if there's no limit on the number of work items and is pointless is there is since you could then just use the main shared memory segment. I really think you should respond to those concerns, not just push a minimal fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)
От
Alvaro Herrera
Дата:
Robert Haas wrote: > On Mon, Aug 14, 2017 at 1:12 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Yeah, the problem that lwlocks aren't released is because the launcher > > is not in a transaction at that point, so AbortCurrentTransaction() > > doesn't release locks like it normally would. The simplest fix (in the > > attached 0001 patch) is to add a LWLockReleaseAll() call to the jmp > > block, though I wonder if there should be some other cleanup functions > > called from there, or whether perhaps it'd be a better strategy to have > > the launcher run in a transaction at all times. > > Well, if you're going to do aborts outside of a transaction, just > adding an LWLockReleaseAll() isn't really sufficient. You need to > look at something like CheckpointerMain() and figure out which of > those push-ups are needed here as well. Probably at least > ConditionVariableCancelSleep(), pgstat_report_wait_end(), > AbortBufferIO(), and UnlockBuffers() -- quite possibly some of those > other AtEOXact calls as well. Agreed. I think a saner answer is to create a single function that does it all, and use it in all places that need it, rather than keep adding more copies of the same thing. Attached revised version of patch does things that way; I think it's good cleanup. (I had to make the ResourceOwnerRelease call be conditional on there being a resowner; seems okay to me.) I put the new cleanup routine in xact.c, which is not exactly the perfect place (had to add access/xact.h to a couple of files), but the fact that the new routine is a cut-down version of another routine in xact.c makes it clear to me that that's where it belongs. I first put it in bootstrap, alongside AuxiliaryProcessMain, but after a while that seemed wrong. > > The other problem is that there's no attempt to handle a failed DSA > > creation/attachment. The second patch just adds a PG_TRY block that > > sets a flag not to try the DSA calls again if the first one fails. It > > throws a single ERROR line, then autovacuum continues without > > workitem support. > > Yeah, and the other question -- which Thomas asked before you > originally committed originally, and which I just now asked again is > "Why in the world are you using DSA for this at all?". There are > serious problems with that which both he and I have pointed out, and > you haven't explained why it's a good idea at any point, AFAICT. The main reason is that I envision that the workitem stuff will be used for other things in the future than just brin summarization, and it seemed a lame idea to just use a fixed-size memory area in the standard autovacuum shared memory area. I think unbounded growth is of course going to be bad. The current coding doesn't allow for any growth beyond the initial fixed size, but it's easier to extend the system from the current point rather than wholly changing shared memory usage pattern while at it. I thought I *had* responded to Thomas in that thread, BTW. > Among those problems: > > 1. It doesn't work if dynamic_shared_memory_type=none. That's OK for > an optional subsystem, but autovacuum is not very optional. Autovacuum as a whole continues to work if there's no dynamic shared memory; it's just the workitems stuff that stops working if there's no DSA. (After fixing the bug that makes it crash in the case of dynamic_shared_memory_type=none, of course). > 2. It allows unbounded bloat if there's no limit on the number of work > items and is pointless is there is since you could then just use the > main shared memory segment. Yeah, there are probably better strategies than just growing the memory area every time another entry is needed. Work-items as a whole need a lot more development from the current point. > I really think you should respond to those concerns, not just push a > minimal fix. We'll just continue to develop things from the current point. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Mon, Aug 14, 2017 at 7:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Yeah, and the other question -- which Thomas asked before you >> originally committed originally, and which I just now asked again is >> "Why in the world are you using DSA for this at all?". There are >> serious problems with that which both he and I have pointed out, and >> you haven't explained why it's a good idea at any point, AFAICT. > > The main reason is that I envision that the workitem stuff will be used > for other things in the future than just brin summarization, and it > seemed a lame idea to just use a fixed-size memory area in the standard > autovacuum shared memory area. I think unbounded growth is of course > going to be bad. The current coding doesn't allow for any growth beyond > the initial fixed size, but it's easier to extend the system from the > current point rather than wholly changing shared memory usage pattern > while at it. I don't accept that argument. All the current code does is allocate one fixed-size chunk of memory in DSA, so the whole pattern would have to be changed *anyway* if you wanted to grow the array. It's not like you are allocating the items one-by-one. I really, really strongly encourage you to rip the use of DSA out here entirely. It is reducing the reliability of a critical part of the system for no actual benefit other than speculation that this is going to be better in the future, and it adds a bunch of failure cases that we could just as well live without. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I really, really strongly encourage you to rip the use of DSA out here > entirely. It is reducing the reliability of a critical part of the > system for no actual benefit other than speculation that this is going > to be better in the future, and it adds a bunch of failure cases that > we could just as well live without. FWIW, I vote with Robert on this. When and if you actually want to make that array resizable, it'd be time to introduce use of a DSA. But right now we need to be looking for simple and reliable solutions for v10. In particular, since right now dsm_type = NONE is still considered supported, we can't really have autovacuum facilities that are dependent on being able to use DSM. That might change in future, but not today. regards, tom lane
Re: [HACKERS] shared memory based stat collector (was: Sharingrecord typmods between backends)
От
Alvaro Herrera
Дата:
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I really, really strongly encourage you to rip the use of DSA out here > > entirely. It is reducing the reliability of a critical part of the > > system for no actual benefit other than speculation that this is going > > to be better in the future, and it adds a bunch of failure cases that > > we could just as well live without. > > FWIW, I vote with Robert on this. When and if you actually want to make > that array resizable, it'd be time to introduce use of a DSA. But right > now we need to be looking for simple and reliable solutions for v10. Okay, I'll make it so. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services