Обсуждение: SPGist "triple parity" concept doesn't work
I've been looking into the problem reported at http://www.postgresql.org/message-id/519A5917.40704@qunar.com and what I find is that we have spgist insertion operations deadlocking against each other because one is descending from page A to page B while the other descends from page B to page A. According to the README file, the "triple parity" page selection algorithm is supposed to prevent that: : While descending the tree, the insertion algorithm holds exclusive lock on : two tree levels at a time, ie both parent and child pages (parent and child : pages can be the same, see notes above). There is a possibility of deadlock : between two insertions if there are cross-referenced pages in different : branches. That is, if inner tuple on page M has a child on page N while : an inner tuple from another branch is on page N and has a child on page M, : then two insertions descending the two branches could deadlock. To prevent : deadlocks we introduce a concept of "triple parity" of pages: if inner tuple : is on page with BlockNumber N, then its child tuples should be placed on the : same page, or else on a page with BlockNumber M where (N+1) mod 3 == M mod 3. : This rule guarantees that tuples on page M will have no children on page N, : since (M+1) mod 3 != N mod 3. That would work fine as long as the invariant is maintained accurately. However, there are at least two cases where the existing code fails to maintain the invariant: 1. In spgAddNodeAction, if the enlarged inner tuple doesn't fit on the current page anymore, we do this: /* * obtain new buffer with the same parity as current, since it will be * a child of same parent tuple */ current->buffer = SpGistGetBuffer(index, GBUF_INNER_PARITY(current->blkno), ... That's fine as long as the parent tuple wasn't also on the current page. If it was on the current page, we end up re-downlinking the parent to a page having the same parity it has, not one more as it should be. I tried to fix this like so: /* * get a new buffer that has the right parity to store a child of * the current tuple's parent */ current->buffer = SpGistGetBuffer(index, GBUF_INNER_PARITY(parent->blkno+ 1), ... but that just moves the problem somewhere else: the link from the parent to the new inner tuple is now guaranteed to follow the parity rules, but the downlinks leading out of it don't follow them anymore. 2. In spgSplitNodeAction, we split an inner tuple into a "prefix" tuple that replaces that inner tuple, and a "postfix" tuple that contains the same downlinks the original tuple did. That's fine as long as we can fit the postfix tuple on the same page. If we can't, we assign it to a page that's one parity level below the current page, and then its outgoing links violate the parity rules. (Keeping the postfix tuple on the current page wouldn't make things better, since we'd still violate the parity rules with respect to either the incoming or outgoing links of the prefix tuple, if it has to go to another page.) With a few repetitions of either of these cases, and some bad luck about placement of the new tuples, you get into situations where two pages each contain downlinks leading to the other; and then a deadlock is just a matter of time. I don't immediately see any good way to fix this. I think the "triple parity" rule as it stands is hopelessly broken, but I don't know what to replace it with, even granting that we don't need to maintain on-disk compatibility. (We'd have to tell people to reindex SPGist indexes anyway, because of the risk that they already contain circular links; so switching to a new layout rule doesn't seem to add any more pain.) Or we could try to modify the insertion algorithm so it doesn't lock two levels of the tree at once, but that seems pretty risky. Thoughts? regards, tom lane
On Thu, Jun 6, 2013 at 10:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > To prevent > : deadlocks we introduce a concept of "triple parity" of pages: if inner tuple > : is on page with BlockNumber N, then its child tuples should be placed on the > : same page, or else on a page with BlockNumber M where (N+1) mod 3 == M mod 3. > : This rule guarantees that tuples on page M will have no children on page N, > : since (M+1) mod 3 != N mod 3. Even if the invariant was maintained why doesn't that just mean you need three concurrent inserts to create a deadlock? -- greg
Greg Stark <stark@mit.edu> writes: > On Thu, Jun 6, 2013 at 10:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> : This rule guarantees that tuples on page M will have no children on page N, >> : since (M+1) mod 3 != N mod 3. > Even if the invariant was maintained why doesn't that just mean you > need three concurrent inserts to create a deadlock? Hm, good point. That reinforces my feeling that the page-number-based approach isn't workable as a guarantee; though we might want to keep that layout rule as a heuristic that would help reduce contention. I thought a little bit about whether we could drop the requirement of locking two tree levels during insertion descent, or at least recover from deadlock if it did occur. One simple fix would be to do a ConditionalLockBuffer on the child level, and if it fails, just abandon the insertion attempt and start over. However that could lead to a lot of wasted work when there's contention, so it's not terribly attractive as-is. Another line of thought is to lock just the single parent tuple, not its whole page, when descending --- then we can't deadlock unless there are actual circularities in the index. We'd probably have to use a heavyweight lock for that, which might be problematic for performance, and I'm not exactly sure about timing of when to take that lock relative to acquiring the page's buffer lock (which we'd still need). There are probably some other ways to attack this. regards, tom lane
Tom Lane wrote: > I don't immediately see any good way to fix this. I think the "triple > parity" rule as it stands is hopelessly broken, but I don't know what > to replace it with, even granting that we don't need to maintain on-disk > compatibility. (We'd have to tell people to reindex SPGist indexes > anyway, because of the risk that they already contain circular links; > so switching to a new layout rule doesn't seem to add any more pain.) > Or we could try to modify the insertion algorithm so it doesn't lock > two levels of the tree at once, but that seems pretty risky. Is this the chance to add a metapage? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Is this the chance to add a metapage? SPGist indexes do have a metapage already --- you're confusing them with Gist. regards, tom lane
> That would work fine as long as the invariant is maintained accurately. > However, there are at least two cases where the existing code fails to > maintain the invariant: Hmm. Didn't catch that during development. > Thoughts? Give me some time to play around it. Will think. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On 7 June 2013 02:32, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hm, good point. That reinforces my feeling that the page-number-based > approach isn't workable as a guarantee; though we might want to keep > that layout rule as a heuristic that would help reduce contention. Can the locks just be taken in, say, numeric order of the pages involved?
Teodor Sigaev <teodor@sigaev.ru> writes: >> Thoughts? > Give me some time to play around it. Will think. I experimented a bit with the idea of taking a heavyweight lock to represent the right to alter an inner tuple. The results were pretty grim: an spgist index build example went from 225 ms to 380 ms, and a test case involving concurrent insertions (basically the complainant's original example pgbench-ified) went from 5400 tps to 4300 tps. I hadn't realized our heavyweight lock code was quite that expensive :-( Anyway I now think that we might be better off with the other idea of abandoning an insertion and retrying if we get a lock conflict. That would at least not create any performance penalty for non-concurrent scenarios; and even in concurrent cases, I suspect you'd have to be rather unlucky to get penalties as bad as the heavyweight-lock solution is showing. regards, tom lane
> Anyway I now think that we might be better off with the other idea of > abandoning an insertion and retrying if we get a lock conflict. That > would at least not create any performance penalty for non-concurrent > scenarios; and even in concurrent cases, I suspect you'd have to be > rather unlucky to get penalties as bad as the heavyweight-lock solution > is showing. Agree, it would be a better workaround for now. I will be able to do this at this friday. I considered the idea to forbid placement of child on the same page as parent, but this implementation a) could significantly increase size of index, b) doesn't solve Greg's point. We definetly need new idea of locking protocol and I'll return to this problem at autumn (sorry, I havn't time in summer to do this research). -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Anyway I now think that we might be better off with the other idea of > abandoning an insertion and retrying if we get a lock conflict. done, look at the patch. I was faced with the fact that my mail is considered spam by postgresql.org, so I repeat some hthoughts from previous mail: I considered the idea to forbid placement of child on the same page as parent, but this implementation a) could significantly increase size of index, b) doesn't solve Greg's point. We definetly need new idea of locking protocol and I'll return to this problem at autumn (sorry, I havn't time in summer to do this research). -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Вложения
Teodor Sigaev <teodor@sigaev.ru> writes: >> Anyway I now think that we might be better off with the other idea of >> abandoning an insertion and retrying if we get a lock conflict. > done, look at the patch. Looks good, committed with some cosmetic adjustments. > We definetly need new idea of locking protocol and I'll return to this > problem at autumn (sorry, I havn't time in summer to do this > research). OK. I think the performance of this way will be okay, actually, in most cases anyhow. It'll do till we have a better idea. regards, tom lane