Обсуждение: Re: [BUGS] BUG #3245: PANIC: failed to re-find shared loc k o b j ect

Поиск
Список
Период
Сортировка

Re: [BUGS] BUG #3245: PANIC: failed to re-find shared loc k o b j ect

От
Heikki Linnakangas
Дата:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Also, we have a generic issue that making fresh entries in a hashtable
>> might result in a concurrent hash_seq_search scan visiting existing
>> entries more than once; that's definitely not something any of the
>> existing callers are thinking about.
> 
> Ouch. Note that we can also miss some entries altogether, which is 
> probably even worse.

In case someone is wondering how that can happen, here's an example. 
We're scanning a bucket that contains four entries, and we split it 
after returning 1:

1 -> 2* -> 3 -> 4

* denotes the next entry the seq scan has stored.

If this is split goes example like this:

1 -> 3
2* -> 4

The seq scan will continue scanning from 2, then 4, and miss 3 altogether.

I briefly went through all callers of hash_seq_init. The only place 
where we explicitly rely on being able to add entries to a hash table 
while scanning it is in tbm_lossify. There's more complex loops in 
portalmem.c and relcache.c, which I think are safe, but would need to 
look closer. There's also the pg_prepared_statement 
set-returning-function that keeps a scan open across calls, which seems 
error-prone.

Should we document the fact that it's not safe to insert new entries to 
a hash table while scanning it, and fix the few call sites that do that, 
or does anyone see a better solution? One alternative would be to 
inhibit bucket splits while a scan is in progress, but then we'd need to 
take care to clean up after each scan.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [BUGS] BUG #3245: PANIC: failed to re-find shared loc k o b j ect

От
Tom Lane
Дата:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I briefly went through all callers of hash_seq_init. The only place 
> where we explicitly rely on being able to add entries to a hash table 
> while scanning it is in tbm_lossify. There's more complex loops in 
> portalmem.c and relcache.c, which I think are safe, but would need to 
> look closer. There's also the pg_prepared_statement 
> set-returning-function that keeps a scan open across calls, which seems 
> error-prone.

The pending-fsync stuff in md.c is also expecting to be able to add
entries during a scan.

I don't think we can go in the direction of forbidding insertions during
a scan --- as the case at hand shows, it's just not always obvious that
that could happen, and finding/fixing such a problem is nigh impossible.
(We were darn fortunate to be able to reproduce this one.)  Plus we have
a couple of places where it's really necessary to be able to do it,
anyway.

The only answer I can see that seems reasonably robust is to change
dynahash.c so that it tracks whether any seq_search scans are open on a
hashtable, and doesn't carry out any splits while one is.  This wouldn't
cost anything noticeable in performance, assuming that not very many
splits are postponed.  The PITA aspect of it is that we'd need to add
bookkeeping mechanisms to ensure that the count of active scans gets
cleaned up on error exit.  It's not like we've not got lots of those,
though.

Possibly we could simplify matters a bit by not worrying about cleaning
up leaked counts at subtransaction abort, ie, the list of open scans
would only get forced to empty at top transaction end.  This carries a
slightly higher risk of meaningful performance degradation, but in
practice I doubt it's a big problem.  If we agreed that then we'd not
need ResourceOwner support --- it could be handled like LWLock counts.

pg_prepared_statement is simply broken --- what if the next-to-scan
statement is deleted between calls?  It'll have to be changed.

Comments?
        regards, tom lane


Re: [BUGS] BUG #3245: PANIC: failed to re-find shared loc k o b j ect

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> The pending-fsync stuff in md.c is also expecting to be able to add
> entries during a scan.

No, mdsync starts the scan from scratch after calling AbsorbFsyncRequests.

> I don't think we can go in the direction of forbidding insertions during
> a scan --- as the case at hand shows, it's just not always obvious that
> that could happen, and finding/fixing such a problem is nigh impossible.
> (We were darn fortunate to be able to reproduce this one.)  Plus we have
> a couple of places where it's really necessary to be able to do it,
> anyway.
> 
> The only answer I can see that seems reasonably robust is to change
> dynahash.c so that it tracks whether any seq_search scans are open on a
> hashtable, and doesn't carry out any splits while one is.  This wouldn't
> cost anything noticeable in performance, assuming that not very many
> splits are postponed.  The PITA aspect of it is that we'd need to add
> bookkeeping mechanisms to ensure that the count of active scans gets
> cleaned up on error exit.  It's not like we've not got lots of those,
> though.

We could have two kinds of seq scans, with and without support for 
concurrent inserts. If you open a scan without that support, it acts 
just like today, and no extra bookkeeping or clean up by the caller is 
required. If you need concurrent inserts, we inhibit bucket splits, but 
it's up to the caller to explicitly close the scan, possibly with 
PG_TRY/CATCH. I'm not sure if that's simpler in the end, but we could 
get away without adding generic bookkeeping mechanism.

> Possibly we could simplify matters a bit by not worrying about cleaning
> up leaked counts at subtransaction abort, ie, the list of open scans
> would only get forced to empty at top transaction end.  This carries a
> slightly higher risk of meaningful performance degradation, but in
> practice I doubt it's a big problem.  If we agreed that then we'd not
> need ResourceOwner support --- it could be handled like LWLock counts.

Hmm. Unlike lwlocks, hash tables can live in different memory contexts, 
so we can't just have list of open scans similar to held_lwlocks array.

Do we need to support multiple simultaneous seq scans of a hash table? I 
suppose we do..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [BUGS] BUG #3245: PANIC: failed to re-find shared loc k o b j ect

От
Tom Lane
Дата:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> The pending-fsync stuff in md.c is also expecting to be able to add
>> entries during a scan.

> No, mdsync starts the scan from scratch after calling AbsorbFsyncRequests.

That was last month ;-).  It doesn't restart any more.

> We could have two kinds of seq scans, with and without support for 
> concurrent inserts.

Yeah, I considered that too, but it just seems too error-prone.  We
could maybe make it trustworthy by having hash_seq_search complain if
it noticed there had been any concurrent insertions --- but then you're
putting new overhead into hash_seq_search, which kind of defeats the
argument for it (and hash_seq_search is a bit of a bottleneck, so extra
cycles there matter).

> Hmm. Unlike lwlocks, hash tables can live in different memory contexts, 
> so we can't just have list of open scans similar to held_lwlocks array.

I had first thought about adding a scan counter to the hashtable control
struct, but the prospect of hash tables being deallocated while the
central list still has references to them seems pretty scary --- we
could find ourselves clobbering some other data structure entirely when
we go to decrement the count.  What seems better now is to have an array
or list of HTAB pointers, one for each active scan (so the same
hashtable might appear in the list multiple times).  When we are
considering whether to split, we have to look through the list to see if
our table is listed.  The list is unlikely to be long so this shouldn't
affect performance.  If a hash table is deallocated while we still
think it has an active scan, nothing very bad happens.  The absolute
worst possible consequence is if some new hash table gets allocated at
exactly the same spot; we'd inhibit splits on it, which still doesn't
break correctness though it might kill performance.  In any case we can
have checking code that complains about leaked scan pointers at
transaction end, so any such bug shouldn't survive long.

For shared hash tables, this design only works for scans being done by
the same backend doing insertion; but locking considerations would
probably require that no other backend inserts while we scan anyway
(you'd need something much more complicated than shared/exclusive locks
to manage it otherwise).
        regards, tom lane


Re: [BUGS] BUG #3245: PANIC: failed to re-find shared loc k o b j ect

От
Tom Lane
Дата:
I wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> We could have two kinds of seq scans, with and without support for 
>> concurrent inserts.

> Yeah, I considered that too, but it just seems too error-prone.  We
> could maybe make it trustworthy by having hash_seq_search complain if
> it noticed there had been any concurrent insertions --- but then you're
> putting new overhead into hash_seq_search, which kind of defeats the
> argument for it (and hash_seq_search is a bit of a bottleneck, so extra
> cycles there matter).

I just finished looking through the uses of hash_seq_search, and
realized that there is one place where it would be a bit painful to
convert to the insertion-safe approach I'm proposing; namely nodeAgg.c.
The places where the hashtable iteration is started and used are
scattered, and we don't really track whether the iteration is done or
not, so it's hard to be sure where to cancel the iteration.  It could
probably be made to work but it seems like it'd be fragile.

I still don't want to introduce more checking overhead into
hash_seq_search, though, so what I'm now thinking about is a new
dynahash primitive named something like "hash_freeze", which'd mark a
hashtable as disallowing insertions.  If the hashtable is frozen before
hash_seq_init then we don't add it to the central list of scans, and
therefore there is no cleanup to do at the end.  nodeAgg can use this
mode since it doesn't modify its hashtable anymore after beginning its
readout scan.

BTW, we didn't really get into details, but for the insertion-safe case
I'm envisioning adding a routine "hash_seq_term", which you would need
to call if and only if you abandon a hash_seq_search scan without
running it to completion (if you do the complete scan, hash_seq_search
will automatically call hash_seq_term before returning NULL).  All but
a very small number of places run their searches to completion and
therefore won't require any source code changes with this API.

Thoughts?
        regards, tom lane


Re: [BUGS] BUG #3245: PANIC: failed to re-find shared loc k o b j ect

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> I wrote:
>> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>>> We could have two kinds of seq scans, with and without support for 
>>> concurrent inserts.
> 
>> Yeah, I considered that too, but it just seems too error-prone.  We
>> could maybe make it trustworthy by having hash_seq_search complain if
>> it noticed there had been any concurrent insertions --- but then you're
>> putting new overhead into hash_seq_search, which kind of defeats the
>> argument for it (and hash_seq_search is a bit of a bottleneck, so extra
>> cycles there matter).
> 
> I just finished looking through the uses of hash_seq_search, and
> realized that there is one place where it would be a bit painful to
> convert to the insertion-safe approach I'm proposing; namely nodeAgg.c.
> The places where the hashtable iteration is started and used are
> scattered, and we don't really track whether the iteration is done or
> not, so it's hard to be sure where to cancel the iteration.  It could
> probably be made to work but it seems like it'd be fragile.
> 
> I still don't want to introduce more checking overhead into
> hash_seq_search, though, so what I'm now thinking about is a new
> dynahash primitive named something like "hash_freeze", which'd mark a
> hashtable as disallowing insertions.  If the hashtable is frozen before
> hash_seq_init then we don't add it to the central list of scans, and
> therefore there is no cleanup to do at the end.  nodeAgg can use this
> mode since it doesn't modify its hashtable anymore after beginning its
> readout scan.

This plan includes having the list of hash tables that mustn't be 
expanded? And the list would be cleaned up at the end of transaction, to 
avoid leaks.

> BTW, we didn't really get into details, but for the insertion-safe case
> I'm envisioning adding a routine "hash_seq_term", which you would need
> to call if and only if you abandon a hash_seq_search scan without
> running it to completion (if you do the complete scan, hash_seq_search
> will automatically call hash_seq_term before returning NULL).  All but
> a very small number of places run their searches to completion and
> therefore won't require any source code changes with this API.

Sounds good to me.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: [BUGS] BUG #3245: PANIC: failed to re-find shared loc k o b j ect

От
Tom Lane
Дата:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> I still don't want to introduce more checking overhead into
>> hash_seq_search, though, so what I'm now thinking about is a new
>> dynahash primitive named something like "hash_freeze", which'd mark a
>> hashtable as disallowing insertions.  If the hashtable is frozen before
>> hash_seq_init then we don't add it to the central list of scans, and
>> therefore there is no cleanup to do at the end.  nodeAgg can use this
>> mode since it doesn't modify its hashtable anymore after beginning its
>> readout scan.

> This plan includes having the list of hash tables that mustn't be 
> expanded? And the list would be cleaned up at the end of transaction, to 
> avoid leaks.

Right, all that's still the same.  This is just a way to exempt certain
scans from that machinery, by allowing the caller to declare he doesn't
need to modify the hashtable anymore.  AFAICS that covers our needs,
at least for the present.
        regards, tom lane