Re: dynamic shared memory

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: dynamic shared memory
Дата
Msg-id 20130920114445.GB5287@awork2.anarazel.de
обсуждение исходный текст
Ответ на Re: dynamic shared memory  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: dynamic shared memory  (Amit Kapila <amit.kapila16@gmail.com>)
Re: dynamic shared memory  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi,


On 2013-09-19 11:44:34 -0400, Robert Haas wrote:
> On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:

> >> --- /dev/null
> >> +++ b/src/backend/storage/ipc/dsm.c
> >> +#define PG_DYNSHMEM_STATE_FILE                       PG_DYNSHMEM_DIR "/state"
> >> +#define PG_DYNSHMEM_NEW_STATE_FILE           PG_DYNSHMEM_DIR "/state.new"
> >
> > Hm, I guess you dont't want to add it  to global/ or so because of the
> > mmap implementation where you presumably scan the directory?
> 
> Yes, and also because I thought this way would make it easier to teach
> things like pg_basebackup (or anybody's home-brew scripts) to just
> skip that directory completely.  Actually, I was wondering if we ought
> to have a directory under pgdata whose explicit charter it was to
> contain files that shouldn't be copied as part of a base backup.
> pg_do_not_back_this_up.

Wondered exactly about that as soon as you've mentioned
pg_basebackup. pg_local/?

> >> +     /* Determine size for new control segment. */
> >> +     maxitems = PG_DYNSHMEM_FIXED_SLOTS
> >> +             + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;
> >
> > It seems likely that MaxConnections would be sufficient?
> 
> I think we could argue about the best way to set this until the cows
> come home, but I don't think it probably matters much at this point.
> We can always change the formula later as we gain experience.
> However, I don't have a principled reason for assuming that only
> user-connected backends will create dynamic shared memory segments.

Hm, yes. I had MaxBackends down as MaxConnections + autovacuum stuff;
but max_worker_processes are in there now, so you're right that doesn't
make sense.

> >> +/*
> >> + * Read and parse the state file.
> >> + *
> >
> > Perhaps CRC32 the content?
> 
> I don't see the point.  If the file contents are garbage that happens
> to look like a number, we'll go "oh, there isn't any such segment" or
> "oh, there is such a segment but it doesn't look like a control
> segment, so forget it".   There are a lot of things we really ought to
> be CRCing to avoid corruption risk, but I can't see how this is
> remotely one of them.

I was worried about a partially written file or containing contents from
two different postmaster cycles, but it's actually far too small for
that...
I initially had thought you'd write the contents of the entire shared
control segment there, not just it's id.

> >> +     /* Create or truncate the file. */
> >> +     statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600);
> >
> > Doesn't this need a | PG_BINARY?
> 
> It's a text file.  Do we need PG_BINARY anyway?

I'd say yes. Non binary mode stuff on windows does stuff like
transforming LF <=> CRLF on reading/writing, which makes sizes not match
up and similar ugliness.
Imo there's little reason to use non-binary mode for anything written
for postgres' own consumption.

> > Why are you using open() and not
> > BasicOpenFile or even OpenTransientFile?
> 
> Because those don't work in the postmaster.

Oh, that's news to me. Seems strange, especially for BasicOpenFile.

> >
> >> +     /* Write contents. */
> >> +     snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n",
> >> +                      (unsigned long) dsm_control_handle);
> >
> > Why are we upcasting the length of dsm_control_handle here? Also,
> > doesn't this need the usual UINT64_FORMAT thingy?
> 
> dsm_handle is an alias for uint32.  Is that always exactly an unsigned
> int or can it sometimes be an unsigned long?  I thought the latter, so
> couldn't figure out how to write this portably without casting to a
> type that explicitly matched the format string.

That should always be an unsigned int on platforms we support. Note that
we've printed TransactionIds that way (i.e. %u) for a long time and they
are a uint32 as well.

> > Not sure whether it's sensible to only LOG in these cases. After all
> > there's something unexpected happening. The robustness argument doesn't
> > count since we're already shutting down.

> I see no point in throwing an error.   The fact that we're having
> trouble cleaning up one dynamic shared memory segment doesn't mean we
> shouldn't try to clean up others, or that any remaining postmaster
> shutdown hooks shouldn't be executed.

Well, it means we'll do a regular shutdown instead of PANICing
and *not* writing a checkpoint.
If something has corrupted our state to the point we cannot unregister
shared memory we registered, something has gone terribly wrong. Quite
possibly we've scribbled over our control structures or such. In that
case it's not proper to do a normal shutdown, we're quite possibly
writing bad data.

> >> +                     ereport(ERROR,
> >> +                                     (errcode(ERRCODE_INTERNAL_ERROR),
> >> +                                      errmsg("dynamic shared memory control segment is not valid")));
> >
> > Imo that's a PANIC or at the very least a FATAL.
> 
> Sure, that's a tempting option, but it doesn't seem to serve any very
> necessary point.  There's no data corruption problem if we proceed
> here.  Most likely either (1) there's a bug in the code, which
> panicking won't fix or (2) the DBA hand-edited the state file, in
> which case maybe he shouldn't have done that, but if he thinks the
> best way to recover from that is a cluster-wide restart, he can do
> that himself.

"There's no data corruption problem if we proceed" - but there likely
has been one leading to the current state.

> >> +dsm_segment *
> >> +dsm_create(uint64 size)
> >
> > Do we rely on being run in an environment with proper setup for lwlock
> > cleanup? I can imagine shared libraries doing this pretty early on...
> 
> Yes, we rely on that.  I don't really see that as a problem.  You'd
> better connect to the main shared memory segment before starting to
> create your own.

I am not talking about lwlocks itself being setup but an environment
that has resource owners defined and catches errors. I am specifically
asking because you're a) ereport()ing without releasing an LWLock b)
unconditionally relying on the fact that there's a current resource
owner.
In shared_preload_libraries neither is the case afair?

Now, you could very well argue that you don't need to use dsm for
shared_preload_libraries but there are enough libraries that you can use
per session or globally. Requiring them to use both implementation or
register stuff later seems like it would complicate things.

> >> +void *
> >> +dsm_resize(dsm_segment *seg, uint64 size)
> >> +{
> >> +     Assert(seg->control_slot != INVALID_CONTROL_SLOT);
> >> +     dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private,
> >> +                             &seg->mapped_address, &seg->mapped_size, ERROR);
> >> +     return seg->mapped_address;
> >> +}
> >
> > Hm. That's valid when there are other backends attached? What are the
> > implications for already attached ones?
> 
> They'll continue to see the portion they have mapped, but must do
> dsm_remap() if they want to see the whole thing.

But resizing can shrink, can it not? And we do an ftruncate() in at
least the posix shmem case. Which means the other backend will get a
SIGSEGV accessing that memory IIRC.

> > Shouldn't we error out if !dsm_impl_can_resize()?
> 
> The implementation-specific code throws an error if it can't support
> resize.  Even if we put a secondary check here, I wouldn't want
> dsm_impl_op to behave in an undefined manner when asked to resize
> under an implementation that can't.  And there doesn't seem to be much
> point in having two checks.

Well, You have the check in dsm_remap() which seems strange to me.

> >> +void
> >> +dsm_detach(dsm_segment *seg)
> >> +{
> >
> > Why do we want to ignore errors like failing to unmap? ISTM that
> > indicates an actual problem...
> 
> Sure it does.  But what are you going to do about it?  In many cases,
> you're going to get here during a transaction abort caused by some
> other error.  If the transaction is already aborting, throwing an
> error here will just cause the original error to get discarded in
> favor of showing this one, or maybe it's the other way around.  I
> don't remember, but it's definitely one or the other, and neither is
> desirable.  Throwing a warning, on the other hand, will notify the
> user, which is what we want.
> 
> Now on the flip side we might not be aborting; maybe we're committing.
>  But we don't want to turn a commit into an abort just for this.  If
> resowner.c detects a buffer pin leak or a tuple descriptor leak, those
> are "just" warning as well.  They're serious warnings, of course, and
> if they happen it means there's a bug in the code that needs to be
> fixed.  But the severity of an ereport() isn't based just on how
> alarming the situation is; it's based on what you want to happen when
> that situation comes up.  And we've decided (correctly, I think) that
> resource leaks are not grounds for aborting a transaction that
> otherwise would have committed.

We're not talking about a missed munmap() but about one that failed. If
we unpin the leaked pins and notice that we haven't actually pinned it
anymore we do error (well, Assert) out. Same for TupleDescs.

If there were valid scenarios in which you could get into that
situation, maybe. But which would that be? ISTM we can only get there if
our internal state is messed up.

> >> + * several implementations of this facility.  Many systems implement
> >> + * POSIX shared memory (shm_open etc.), which is well-suited to our needs
> >> + * in this area, with the exception that shared memory identifiers live
> >> + * in a flat system-wide namespace, raising the uncomfortable prospect of
> >> + * name collisions with other processes (including other copies of
> >> + * PostgreSQL) running on the same system.
> >
> > Why isn't the port number part of the posix shmem identifiers? Sure, we
> > retry, but using a logic similar to sysv_shmem.c seems like a good idea.
> 
> According to the man page for shm_open on Solaris, "For maximum
> portability, name should include no more than 14 characters, but this
> limit is not enforced."

What about "/pgsql.%u" or something similar? That should still fit.

> >> +     /*
> >> +      * If we're reattaching or resizing, we must remove any existing mapping,
> >> +      * unless we've already got the right thing mapped.
> >> +      */
> >> +     if (*mapped_address != NULL)
> >> +     {
> >> +             if (*mapped_size == request_size)
> >> +                     return true;
> >
> > Hm. It could have gotten resized to the old size, or resized twice. In
> > that case it might not be at the same address before, so checking for
> > the size doesn't seem to be sufficient.
> 
> I don't understand your concern.  If someone resize the DSM to its
> already-current size, there is no need to remap it.  The old mapping
> is just fine.  And if some other backend resizes the DSM to a larger
> size and then back to the original size, and then we're asked to
> update the mapping, there is no need to change anything.

Yes, forget what I said. I was confusing myself.

> >> +static int
> >> +errcode_for_dynamic_shared_memory()
> >> +{
> >> +     if (errno == EFBIG || errno == ENOMEM)
> >> +             return errcode(ERRCODE_OUT_OF_MEMORY);
> >> +     else
> >> +             return errcode_for_file_access();
> >> +}
> >
> > Is EFBIG guaranteed to be defined?
> 
> I dunno.  We could put an #ifdef around that part.  Should we do that
> now or wait and see if it actually breaks anywhere?

A bit of googling around seems to indicate it's likely to be
available. Even on windows according to MSDN.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: SSI freezing bug
Следующее
От: Andres Freund
Дата:
Сообщение: Re: SSI freezing bug