Обсуждение: Backend crashes - what's going on here???
Hey, the current snapshot dumps core on the 4th time doing REVOKE ALL ON pg_user FROM public; It does too in other situations but this is the simplest to reproduce. The segmentation fault happens in nocachegetattr() due to a destroyed tuple descriptor (natts = 0!!! and the others don't look good either) for the syscache 21 (USENAME). But the destruction must happen somewhere else. With the 02/13 snapshot I haven't got any problems on it. But cannot find the error with diff. BTW: Doing last checks on view permissions - sending a patch soon. Until later, Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
> > Hey, > > the current snapshot dumps core on the 4th time doing > > REVOKE ALL ON pg_user FROM public; > > It does too in other situations but this is the simplest to > reproduce. The segmentation fault happens in nocachegetattr() > due to a destroyed tuple descriptor (natts = 0!!! and the > others don't look good either) for the syscache 21 (USENAME). > But the destruction must happen somewhere else. > > With the 02/13 snapshot I haven't got any problems on it. > But cannot find the error with diff. > > BTW: Doing last checks on view permissions - sending a patch > soon. Yep, I saw this too when testing my password acl null patch. Couldn't reproduce it, so I thought it was a fluke. -- Bruce Momjian maillist@candle.pha.pa.us
Whow - gdb is a nice tool > > > > > Hey, > > > > the current snapshot dumps core on the 4th time doing > > > > REVOKE ALL ON pg_user FROM public; > > > > It does too in other situations but this is the simplest to > > reproduce. The segmentation fault happens in nocachegetattr() > > due to a destroyed tuple descriptor (natts = 0!!! and the > > others don't look good either) for the syscache 21 (USENAME). > > But the destruction must happen somewhere else. > > > > With the 02/13 snapshot I haven't got any problems on it. > > But cannot find the error with diff. > > > > BTW: Doing last checks on view permissions - sending a patch > > soon. > > Yep, I saw this too when testing my password acl null patch. Couldn't > reproduce it, so I thought it was a fluke. > > -- > Bruce Momjian > maillist@candle.pha.pa.us > Have a clue now what causes the crash. It happens when pg_user is looked up in the syscache. It must have to do with the fact that during initialization in miscinit.c on SetUserId() the user tuple is fetched using SearchSysCacheTuple(). Due to this the SysCache entry 21 gets initialized but later on start transaction through the cache reset the memory for the cc_tupdesc in the cache is freed. So I assume when SetUserId() is called, the syscache is not ready for use yet. I don't have a solution right now. Is someone more familiar with the handling of the syscache during startup? Is SetUserId() just called a little too early or is the syscache unusable during InitPostgres at all? But the fact that CatalogCacheInitializeCache() is called only for pg_user during startup makes me feel sure that the lookup of the user using SearchSysCacheTuple() is wrong at this time. I think it sould be done without using the syscache. Back on monday - maybe with a solution. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #
Uhhh - much more ugly than I thought first :-( I wrote: > > > Whow - gdb is a nice tool > > > > > > > > > Hey, > > > > > > the current snapshot dumps core on the 4th time doing > > > > > > REVOKE ALL ON pg_user FROM public; > > > > > > It does too in other situations but this is the simplest to > > > reproduce. The segmentation fault happens in nocachegetattr() > > > due to a destroyed tuple descriptor (natts = 0!!! and the > > > others don't look good either) for the syscache 21 (USENAME). > > > But the destruction must happen somewhere else. > > > > > > With the 02/13 snapshot I haven't got any problems on it. > > > But cannot find the error with diff. > > > > > > BTW: Doing last checks on view permissions - sending a patch > > > soon. > > > > Yep, I saw this too when testing my password acl null patch. Couldn't > > reproduce it, so I thought it was a fluke. > > > > -- > > Bruce Momjian > > maillist@candle.pha.pa.us > > > > Have a clue now what causes the crash. It happens when > pg_user is looked up in the syscache. It must have to do with > the fact that during initialization in miscinit.c on > SetUserId() the user tuple is fetched using > SearchSysCacheTuple(). Due to this the SysCache entry 21 > gets initialized but later on start transaction through the > cache reset the memory for the cc_tupdesc in the cache is > freed. So I assume when SetUserId() is called, the syscache > is not ready for use yet. > > I don't have a solution right now. Is someone more familiar > with the handling of the syscache during startup? Is > SetUserId() just called a little too early or is the syscache > unusable during InitPostgres at all? > > But the fact that CatalogCacheInitializeCache() is called > only for pg_user during startup makes me feel sure that the > lookup of the user using SearchSysCacheTuple() is wrong at > this time. I think it sould be done without using the > syscache. > > Back on monday - maybe with a solution. The crash is due to the cache invalidations on updates to pg_class (and can happen too on updates to pg_attribute and others). When a tuple in pg_class or the others is modified, its cache invalidation causes a RelationFlushRelation() for the affected relation. revoking from pg_user e.g. means that RelationFlushRelation() is called for pg_user but this frees the tuple desctiptor. The tuple descriptor is also used in the SysCache, and this isn't flushed/freed! There are more possible errors on this. A simple UPDATE pg_class SET relname = relname; let's the backend crash on the very next command. And REVOKE ALL ON pg_class FROM public; crashes immediately because the cache invalidation needs the just invalidated heap tuple for pg_class in pg_class. Sounds a bit hairy. I think this is also the reason for backend crashes I had when defining rewrite rules on relations that already exist (where I expect others that already noticed them). I still don't have the solution. But this must get fixed before releasing 6.3. I think a walk through the SysCache on RelationFlushRelation() looking if this relation is in the SysCache and if found resetting this cache can help (except for the revoke on pg_class). Append this to TODO! Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #======================================== jwieck@debis.com (Jan Wieck) #