Re: [HACKERS] Sure enough, SI buffer overrun is broken
От | Tom Lane |
---|---|
Тема | Re: [HACKERS] Sure enough, SI buffer overrun is broken |
Дата | |
Msg-id | 29928.949175599@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | RE: [HACKERS] Sure enough, SI buffer overrun is broken ("Hiroshi Inoue" <Inoue@tpf.co.jp>) |
Ответы |
RE: [HACKERS] Sure enough, SI buffer overrun is broken
("Hiroshi Inoue" <Inoue@tpf.co.jp>)
|
Список | pgsql-hackers |
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> I built the current sources with MAXNUMMESSAGES set to 32 in >> src/include/storage/sinvaladt.h. The regular regress tests >> run OK, with just a few NOTICEs about 'cache state reset' >> and 'SI buffer overflow' inserted in the normal outputs >> (as you'd expect, if SI overrun occurs). >> >> However, the parallel tests crash spectacularly, with weird errors >> and Assert() coredumps. > Is the call RelationCacheInvalidate(false not true) in ResetSys- > temCaches() right ? Relation descriptors would be destoryed if > ResetSystemCaches() occurs in CommandConterIncrement(). You are absolutely right. I have thought before that it was extremely dangerous for RelationCacheInvalidate *ever* to blow away a relcache entry with a positive refcount. I have changed ResetSystemCaches to pass TRUE, and have modified the comments for RelationCacheInvalidate to indicate that this is probably the only safe setting. I also changed RelationIdInvalidateRelationCacheByRelationId to unconditionally pass true to RelationFlushRelation. Now, RelationFlushRelation is *never* called with onlyFlushReferenceCountZero=false, so it will always attempt to rebuild a relcache entry that has positive refcount. I suspect that the "feature" of RelationFlushRelation to allow blowing away a relcache entry regardless of refcount was a hack put in back when relcache refcounts couldn't be trusted (because elog(ERROR) would leave refcounts positive). Now we have RelationCacheAbort to fix refcounts after elog, so I see no reason to take the risk of trying to destroy an open relcache entry. With these two changes in place, the parallel regress tests seem much more stable. There is still a big problem though, which is that the "portals" regress test sometimes fails. I traced this far enough to discover that the code is trying to use a TupleDesc that it's stored in the ScanTupleSlot of a plan, and this TupleDesc is no longer valid --- presumably the relcache entry it was gotten from has been flushed and rebuilt, leaving the plan with a dangling TupleDesc pointer. Ugh. I do not think it is very practical to try to change all of the places that assume that they can save pointers to the TupleDesc of a relcache entry. Instead I am thinking about solving the problem inside the relcache, as follows: During a relcache entry rebuild, do not simply free and reconstruct the TupleDesc. Instead, read the catalogs to build a new TupleDesc, and compare this one to the old one. If they are the same, free the new one instead (leaving the old one in place, and hence stored pointers to it are still valid). If they are not the same, then elog(ERROR). (elog may sound overly paranoid, but this condition indicates that the table's definition has actually changed since we grabbed the refcount on it --- remember we wouldn't be doing this at all if the relcache entry had zero refcount --- and therefore all kinds of derived information such as plans may be wrong. Pressing ahead will probably lead to crash.) We may need to do the same for any other substructures of the relcache entry that are visible from outside relcache.c. I know this sounds pretty grotty, but we are already doing it for the relcache entry itself --- rebuild re-uses the same physical entry rather than deleting and reallocating it, to ensure that pointers to an open Relation stay valid over an SI flush. We need to extend the same guarantee to the substructure of the relcache entry. Unless someone has a better idea, I'll work on that. >> Some of the unexpected messages in the >> postmaster log are: >> >> NOTICE: LockRelease: locktable lookup failed, no lock > I have seen this NOTICE only once or twice. > This seems because of setting MyProc->xid to InvalidTransa > ctionId in CommitTransaction() and AbortTransaction(). > There's a little time until AtCommit(Abort)_Locks. > I have no idea to solve this now. I am not seeing it after the change to never flush relcaches with positive refcount. I think the locks being complained of are probably the locks that should have been kept on flushed relations, and that it's not CommitTransaction's fault. regards, tom lane
В списке pgsql-hackers по дате отправления:
Предыдущее
От: "Jeff MacDonaldДата:
Сообщение: Re: [HACKERS] Postgres LOGO(was: Beta is February 15th)