Обсуждение: why doesn't DestroyPartitionDirectory hash_destroy?
Hi, I'm curious why DestroyPartitionDirectory doesn't do hash_destroy(pdir->pdir_hash)? Thanks, Amit
At Thu, 14 Mar 2019 16:13:23 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <3ad792cd-0805-858e-595c-c09e9d1ce042@lab.ntt.co.jp> > Hi, > > I'm curious why DestroyPartitionDirectory doesn't do > hash_destroy(pdir->pdir_hash)? Maybe it is trashed involved in destruction of es_query_cxt or planner_cxt? -- Kyotaro Horiguchi NTT Open Source Software Center
On 2019/03/14 16:32, Kyotaro HORIGUCHI wrote: > At Thu, 14 Mar 2019 16:13:23 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <3ad792cd-0805-858e-595c-c09e9d1ce042@lab.ntt.co.jp> >> Hi, >> >> I'm curious why DestroyPartitionDirectory doesn't do >> hash_destroy(pdir->pdir_hash)? > > Maybe it is trashed involved in destruction of es_query_cxt or > planner_cxt? Hmm, the executor's partition directory (its hash table) is indeed allocated under es_query_cxt. But, the planner's partition directory is not allocated under planner_cxt, it appears to be using memory under MessageContext. Thanks, Amit
On 2019/03/14 16:46, Amit Langote wrote: > On 2019/03/14 16:32, Kyotaro HORIGUCHI wrote: >> At Thu, 14 Mar 2019 16:13:23 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <3ad792cd-0805-858e-595c-c09e9d1ce042@lab.ntt.co.jp> >>> Hi, >>> >>> I'm curious why DestroyPartitionDirectory doesn't do >>> hash_destroy(pdir->pdir_hash)? >> >> Maybe it is trashed involved in destruction of es_query_cxt or >> planner_cxt? > > Hmm, the executor's partition directory (its hash table) is indeed > allocated under es_query_cxt. But, the planner's partition directory is > not allocated under planner_cxt, it appears to be using memory under > MessageContext. I forgot to mention that it would be wrong to put it under planner_cxt, as it's the context for planning a given subquery, whereas a partition directory is maintained throughout the whole planning. Thanks, Amit
At Thu, 14 Mar 2019 17:18:29 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <e7bcef25-317b-75fc-cc63-34e1fbec514b@lab.ntt.co.jp> > On 2019/03/14 16:46, Amit Langote wrote: > > On 2019/03/14 16:32, Kyotaro HORIGUCHI wrote: > >> At Thu, 14 Mar 2019 16:13:23 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <3ad792cd-0805-858e-595c-c09e9d1ce042@lab.ntt.co.jp> > >>> Hi, > >>> > >>> I'm curious why DestroyPartitionDirectory doesn't do > >>> hash_destroy(pdir->pdir_hash)? > >> > >> Maybe it is trashed involved in destruction of es_query_cxt or > >> planner_cxt? > > > > Hmm, the executor's partition directory (its hash table) is indeed > > allocated under es_query_cxt. But, the planner's partition directory is > > not allocated under planner_cxt, it appears to be using memory under > > MessageContext. CurrentMemoryContext? It is PortalContext while planning CLUSTER scan. And it seems to be the same with planner_cxt with several narrow exceptions.. I think everything linked from PlannrInfo ought to be allocated in the top planner's planner_cxt if finally not expilcitly delinked and I *believe* subquery's planner_cxt is always same with that of the top-level subquery_planner. > I forgot to mention that it would be wrong to put it under planner_cxt, as > it's the context for planning a given subquery, whereas a partition > directory is maintained throughout the whole planning. So if the ParitionDirectory is allocaed explicitly in MessageContext, it would be danger on CLUSTER command. But as I see in the code, if it is in CurrentMemoryContext it would be safe but I think it should be fixed to use planner_cxt. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Mar 14, 2019 at 3:13 AM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I'm curious why DestroyPartitionDirectory doesn't do > hash_destroy(pdir->pdir_hash)? What would be the point? It's more efficient to let context teardown take care of it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Mar 14, 2019 at 3:13 AM Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I'm curious why DestroyPartitionDirectory doesn't do >> hash_destroy(pdir->pdir_hash)? > What would be the point? It's more efficient to let context teardown > take care of it. Agreed, but the comments in this area are crap. Why doesn't CreatePartitionDirectory say something like * The object lives inside the given memory context and will be * freed when that context is destroyed. Nonetheless, the caller * must *also* ensure that (unless the transaction is aborted) * DestroyPartitionDirectory is called before that happens, else * we may leak some relcache reference counts. It's completely not acceptable that every reader of this code should have to reverse-engineer these design assumptions, especially given how shaky they are. There's an independent question as to whether the planner's use of the feature is specifying a safe memory context. Has this code been exercised under GEQO? regards, tom lane
I wrote: > Agreed, but the comments in this area are crap. Actually, now that I've absorbed a bit more about 898e5e329, I don't like very much about it at all. I think having it try to hang onto pointers into the relcache is a completely wrongheaded design decision, and the right way for it to work is to just copy the PartitionDescs out of the relcache so that they're fully owned by the PartitionDirectory. I don't see a CopyPartitionDesc function anywhere (maybe it's named something else?) but it doesn't look like it'd be hard to build; most of the work is in partition_bounds_copy() which does exist already. Also, at least so far as the planner's usage is concerned, claiming that we're saving something by not copying is completely bogus, because if we look into set_relation_partition_info, what do we find but a partition_bounds_copy call. That wouldn't be necessary if we could rely on the PartitionDirectory to own the data structure. (Maybe it's not necessary today. But given what a house of cards this is, I wouldn't propose ripping it out, just moving it into the PartitionDirectory code.) regards, tom lane
On Thu, Mar 14, 2019 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Agreed, but the comments in this area are crap. Why doesn't > CreatePartitionDirectory say something like > > * The object lives inside the given memory context and will be > * freed when that context is destroyed. Nonetheless, the caller > * must *also* ensure that (unless the transaction is aborted) > * DestroyPartitionDirectory is called before that happens, else > * we may leak some relcache reference counts. > > It's completely not acceptable that every reader of this code should > have to reverse-engineer these design assumptions, especially given > how shaky they are. Well, one reason is that everything you just said is basically self-evident. If you spend 5 seconds looking at the header file, you'll see that there is a CreatePartitionDirectory() function and a DestroyPartitionDirectory() function, and so you'll probably figure out that the latter is intended to be called rather than just ignored. You will probably also guess that it doesn't need to be called if there's an ERROR, just like basically everything else in PostgreSQL. And if you want to know why, you can look at the code and you shouldn't have any trouble determining that it releases relcache ref counts, which may tip you off that if you don't call it, some relcache refcounts will not be released. But, look, I wrote the code. What's clear to me may not be clear to everybody. I have to admit I'm kinda surprised that this is the thing that is confusing to anybody, but if it is, then sure, let's add the comment! > There's an independent question as to whether the planner's use of > the feature is specifying a safe memory context. Has this code been > exercised under GEQO? Probably not. That's a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Mar 14, 2019 at 12:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's completely not acceptable that every reader of this code should >> have to reverse-engineer these design assumptions, especially given >> how shaky they are. > Well, one reason is that everything you just said is basically > self-evident. If you spend 5 seconds looking at the header file, > you'll see that there is a CreatePartitionDirectory() function and a > DestroyPartitionDirectory() function, and so you'll probably figure > out that the latter is intended to be called rather than just ignored. > You will probably also guess that it doesn't need to be called if > there's an ERROR, just like basically everything else in PostgreSQL. > And if you want to know why, you can look at the code and you > shouldn't have any trouble determining that it releases relcache ref > counts, which may tip you off that if you don't call it, some relcache > refcounts will not be released. So here's my problem with that argument: you're effectively saying that you needn't write any API spec for the PartitionDirectory functions because you intend that every person calling them will read their code, carefully and fully, before using them. This is not my idea of sound software engineering. If you need me to spell out why not, I will do so, but I'd like to think that I needn't explain abstraction to a senior committer. But anyway, it's somewhat moot because I think we need to change this API spec anyhow, per the other thread. PartitionDirectory should not be holding on to relcache refcounts, which would make DestroyPartitionDirectory unnecessary. regards, tom lane
On Thu, Mar 14, 2019 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Actually, now that I've absorbed a bit more about 898e5e329, > I don't like very much about it at all. I think having it > try to hang onto pointers into the relcache is a completely > wrongheaded design decision, and the right way for it to work > is to just copy the PartitionDescs out of the relcache so that > they're fully owned by the PartitionDirectory. I don't see > a CopyPartitionDesc function anywhere (maybe it's named something > else?) but it doesn't look like it'd be hard to build; most > of the work is in partition_bounds_copy() which does exist already. Yeah, we could do that. I have to admit that I don't necessarily understand why trying to hang onto pointers into the relcache is a bad idea. It is a bit complicated, but the savings in both memory and CPU time seem worth pursuing. There are a lot of users who wish we scaled to a million partitions rather than a hundred, and just copying everything all over the place all the time won't get us closer to that goal. More generally, I think get_relation_info() is becoming an increasingly nasty piece of work. It copies more and more stuff instead of just pointing to it, which is necessary mostly because it closes the table instead of arranging to do that at the end of query planning. If it did the opposite, the refcount held by the planner would make it unnecessary for the PartitionDirectory to hold one, and I bet we could also just point to a bunch of the other stuff in this function rather than copying that stuff, too. As time goes by, relcache entries are getting more and more complex, and the optimizer wants to use more and more data from them for planning purposes, but, probably partly because of inertia, we're clinging to an old design where everything has to be copied. Every time someone gets that wrong, and it's happened a number of times, we yell at them and tell them to copy more stuff instead of thinking up a design where stuff doesn't need to be copied. I think that's a mistake. You have previously disagreed with that position so you probably will now, too, but I still think it. > Also, at least so far as the planner's usage is concerned, claiming > that we're saving something by not copying is completely bogus, > because if we look into set_relation_partition_info, what do we > find but a partition_bounds_copy call. That wouldn't be necessary > if we could rely on the PartitionDirectory to own the data structure. > (Maybe it's not necessary today. But given what a house of cards > this is, I wouldn't propose ripping it out, just moving it into > the PartitionDirectory code.) Ugh, I didn't notice the partition_bounds_copy() call in that function. Oops. Given the foregoing griping, you won't be surprised to hear that I'd rather just remove the copying step. However, it sounds like we can do that no matter whether we stick with my design or switch to yours, because PartitionDirectory holds a relcache refcnt then the pointer will be stable, and if we deep-copy the whole data structure then the pointer will also be stable. Prior to the commit at issue, we weren't doing either of those things, so that copy was needed until very recently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 14, 2019 at 1:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > So here's my problem with that argument: you're effectively saying that > you needn't write any API spec for the PartitionDirectory functions > because you intend that every person calling them will read their code, > carefully and fully, before using them. This is not my idea of sound > software engineering. If you need me to spell out why not, I will do > so, but I'd like to think that I needn't explain abstraction to a senior > committer. I think you're attacking a straw man. I expect you don't seriously believe that I lack an understanding of abstraction. However, abstraction doesn't mean that the comment for CreatePartitionDirectory must describe what DestroyPartitionDirectory is going to do internally, as you seem to be proposing. Had I thought about this issue more sooner, I think I would have guessed you would be opposed to such a comment, since the chances of someone neglecting to update it when changing DestroyPartitionDirectory seem to be non-negligible. At the same time, and as I already said, I am also fine to improve the comments for these functions. The fact that both you and Amit found them inadequate - albeit in somewhat different ways - shows that they need improvement. However, that doesn't mean that what I did was flagrantly unreasonable or that I'm full of crap. Those things may be true, but this isn't believable evidence of it. If we're going to start talking about comments that make inadequate mention of important memory management details, I think a much better place to start than anything that I did in this commit would be the get_relation_info() function we were just discussing in a different part of this email thread. As I said over there, people keep failing to understand that any data you want to use during query planning has got to be copied out of the relcache in that function -- and this is a pretty subtle hazard, actually, because it works fine unless you get a cache flush at the wrong time or build with CLOBBER_CACHE_ALWAYS. Failing to call DestroyPartitionDirectory() breaks in a much more obvious way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019/03/15 1:02, Robert Haas wrote: > On Thu, Mar 14, 2019 at 3:13 AM Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I'm curious why DestroyPartitionDirectory doesn't do >> hash_destroy(pdir->pdir_hash)? > > What would be the point? It's more efficient to let context teardown > take care of it. Yeah, I only noticed that after posting my email. As I said in another reply, while the executor's partition directory is set up and torn down under a dedicated memory context used for execution (es_query_context), planner's is stuck into MessageContext. But all of the other stuff that planner allocates goes into it too, so maybe it's fine. Thanks, Amit
On 2019/03/15 2:56, Robert Haas wrote: > On Thu, Mar 14, 2019 at 1:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Actually, now that I've absorbed a bit more about 898e5e329, >> I don't like very much about it at all. I think having it >> try to hang onto pointers into the relcache is a completely >> wrongheaded design decision, and the right way for it to work >> is to just copy the PartitionDescs out of the relcache so that >> they're fully owned by the PartitionDirectory. I don't see >> a CopyPartitionDesc function anywhere (maybe it's named something >> else?) but it doesn't look like it'd be hard to build; most >> of the work is in partition_bounds_copy() which does exist already. > > Yeah, we could do that. I have to admit that I don't necessarily > understand why trying to hang onto pointers into the relcache is a bad > idea. It is a bit complicated, but the savings in both memory and CPU > time seem worth pursuing. There are a lot of users who wish we scaled > to a million partitions rather than a hundred, and just copying > everything all over the place all the time won't get us closer to that > goal. > > More generally, I think get_relation_info() is becoming an > increasingly nasty piece of work. It copies more and more stuff > instead of just pointing to it, which is necessary mostly because it > closes the table instead of arranging to do that at the end of query > planning. If it did the opposite, the refcount held by the planner > would make it unnecessary for the PartitionDirectory to hold one, and > I bet we could also just point to a bunch of the other stuff in this > function rather than copying that stuff, too. As time goes by, > relcache entries are getting more and more complex, and the optimizer > wants to use more and more data from them for planning purposes, but, > probably partly because of inertia, we're clinging to an old design > where everything has to be copied. Every time someone gets that > wrong, and it's happened a number of times, we yell at them and tell > them to copy more stuff instead of thinking up a design where stuff > doesn't need to be copied. I think that's a mistake. You have > previously disagreed with that position so you probably will now, too, > but I still think it. +1. >> Also, at least so far as the planner's usage is concerned, claiming >> that we're saving something by not copying is completely bogus, >> because if we look into set_relation_partition_info, what do we >> find but a partition_bounds_copy call. That wouldn't be necessary >> if we could rely on the PartitionDirectory to own the data structure. >> (Maybe it's not necessary today. But given what a house of cards >> this is, I wouldn't propose ripping it out, just moving it into >> the PartitionDirectory code.) > > Ugh, I didn't notice the partition_bounds_copy() call in that > function. Oops. Given the foregoing griping, you won't be surprised > to hear that I'd rather just remove the copying step. However, it > sounds like we can do that no matter whether we stick with my design > or switch to yours, because PartitionDirectory holds a relcache refcnt > then the pointer will be stable, and if we deep-copy the whole data > structure then the pointer will also be stable. Prior to the commit > at issue, we weren't doing either of those things, so that copy was > needed until very recently. One of the patches I've proposed in the "speed up planning with partitions" thread [1] gets rid of the partition_bounds_copy() call, because: 1) I too think it's unnecessary as the PartitionBoundInfo won't change (logically or physically) as long as we've got some lock on the table, which we do, 2) I've seen it become a significant bottleneck as the number of partitions crosses thousands. Thanks, Amit [1] https://commitfest.postgresql.org/22/1778/