Re: dynamic shared memory
От | Robert Haas |
---|---|
Тема | Re: dynamic shared memory |
Дата | |
Msg-id | CA+TgmobKLeC8nYWaa6aTW5gvXktuRf6TRttoJbYroi=Nno4qUw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: dynamic shared memory (Andres Freund <andres@2ndquadrant.com>) |
Ответы |
Re: dynamic shared memory
|
Список | pgsql-hackers |
On Wed, Sep 18, 2013 at 1:42 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Maybe also check for shm_unlink or is that too absurd? Well, if we find that that there are systems were shm_open is present and shm_unlink is not present, then we'll need such a test. But I hesitate to decide what the right thing to do on such systems without knowing more. And the time it takes to run configure is a very substantial percentage of the complete build time, at least on my box, so I don't really want to add tests just because they might be needed somewhere. >> --- /dev/null >> +++ b/src/backend/storage/ipc/dsm.c >> +#define PG_DYNSHMEM_STATE_FILE PG_DYNSHMEM_DIR "/state" >> +#define PG_DYNSHMEM_NEW_STATE_FILE PG_DYNSHMEM_DIR "/state.new" > > Hm, I guess you dont't want to add it to global/ or so because of the > mmap implementation where you presumably scan the directory? Yes, and also because I thought this way would make it easier to teach things like pg_basebackup (or anybody's home-brew scripts) to just skip that directory completely. Actually, I was wondering if we ought to have a directory under pgdata whose explicit charter it was to contain files that shouldn't be copied as part of a base backup. pg_do_not_back_this_up. > Document that's backend local? > And those are shared memory? Sure, done. >> + /* Determine size for new control segment. */ >> + maxitems = PG_DYNSHMEM_FIXED_SLOTS >> + + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends; > > It seems likely that MaxConnections would be sufficient? I think we could argue about the best way to set this until the cows come home, but I don't think it probably matters much at this point. We can always change the formula later as we gain experience. However, I don't have a principled reason for assuming that only user-connected backends will create dynamic shared memory segments. > Comment that we loop endlessly to find a unused identifier. Done. > Why do we create the control segment in dynamic smem and not in the > normal shmem? Presumably because this way it has the same lifetime? If > so, that should be commented upon. If it were part of the normal shared memory segment, I don't think there'd be any good way to implement dsm_cleanup_using_control_segment(). > So, we leave it just hanging around... Well, it has precedent in our > normal shared memory handling. And, it could be long to some unrelated process. > I still maintain that the extra infrastructure required isn't worth the > gain of having the mmap implementation. I know. >> +/* >> + * Read and parse the state file. >> + * > > Perhaps CRC32 the content? I don't see the point. If the file contents are garbage that happens to look like a number, we'll go "oh, there isn't any such segment" or "oh, there is such a segment but it doesn't look like a control segment, so forget it". There are a lot of things we really ought to be CRCing to avoid corruption risk, but I can't see how this is remotely one of them. >> + /* Create or truncate the file. */ >> + statefd = open(PG_DYNSHMEM_NEW_STATE_FILE, O_RDWR|O_CREAT|O_TRUNC, 0600); > > Doesn't this need a | PG_BINARY? It's a text file. Do we need PG_BINARY anyway? > Why are you using open() and not > BasicOpenFile or even OpenTransientFile? Because those don't work in the postmaster. > >> + /* Write contents. */ >> + snprintf(statebuf, PG_DYNSHMEM_STATE_BUFSIZ, "%lu\n", >> + (unsigned long) dsm_control_handle); > > Why are we upcasting the length of dsm_control_handle here? Also, > doesn't this need the usual UINT64_FORMAT thingy? dsm_handle is an alias for uint32. Is that always exactly an unsigned int or can it sometimes be an unsigned long? I thought the latter, so couldn't figure out how to write this portably without casting to a type that explicitly matched the format string. > But we don't even get calld in that case, right? You're only regitering > the on_shmem_exit() handler after the same check in startup? True. Removed. > I'd rename dsm_control_segment_sane to dsm_control_segment_looks_sane ;) Meh. I write too many really long function names as it is. > I'd include the refcount here, might be helpful for debugging. OK, done. > Not sure whether it's sensible to only LOG in these cases. After all > there's something unexpected happening. The robustness argument doesn't > count since we're already shutting down. I see no point in throwing an error. The fact that we're having trouble cleaning up one dynamic shared memory segment doesn't mean we shouldn't try to clean up others, or that any remaining postmaster shutdown hooks shouldn't be executed. >> + ereport(ERROR, >> + (errcode(ERRCODE_INTERNAL_ERROR), >> + errmsg("dynamic shared memory control segment is not valid"))); > > Imo that's a PANIC or at the very least a FATAL. Sure, that's a tempting option, but it doesn't seem to serve any very necessary point. There's no data corruption problem if we proceed here. Most likely either (1) there's a bug in the code, which panicking won't fix or (2) the DBA hand-edited the state file, in which case maybe he shouldn't have done that, but if he thinks the best way to recover from that is a cluster-wide restart, he can do that himself. >> +dsm_segment * >> +dsm_create(uint64 size) > > Do we rely on being run in an environment with proper setup for lwlock > cleanup? I can imagine shared libraries doing this pretty early on... Yes, we rely on that. I don't really see that as a problem. You'd better connect to the main shared memory segment before starting to create your own. > Why can we see that state? Shouldn't locking prevent that? We initialize the refcnt to 2 on creation, so that it ends up as 1 when the last reference is gone. This prevents race conditions: imagine that one process creates a DSM and then passes the handle to some other process. But, before that second process gets around to mapping it, the first process dies. At that point, there are no remaining references to the dynamic shared memory segment, so it goes away, as per spec. But what if we're in the middle of destroying when the second process finally gets around to trying to map it? In such a situation, the behavior would be implementation-specific. I felt that it was better not to find out whether all operating systems and shared memory types handle such cases similarly, because I'm pretty sure they don't. The two-stage destruction process means that once we've committed to destroying the segment, no one else will attempt to map it. >> +void * >> +dsm_resize(dsm_segment *seg, uint64 size) >> +{ >> + Assert(seg->control_slot != INVALID_CONTROL_SLOT); >> + dsm_impl_op(DSM_OP_RESIZE, seg->handle, size, &seg->impl_private, >> + &seg->mapped_address, &seg->mapped_size, ERROR); >> + return seg->mapped_address; >> +} > > Hm. That's valid when there are other backends attached? What are the > implications for already attached ones? They'll continue to see the portion they have mapped, but must do dsm_remap() if they want to see the whole thing. > Shouldn't we error out if !dsm_impl_can_resize()? The implementation-specific code throws an error if it can't support resize. Even if we put a secondary check here, I wouldn't want dsm_impl_op to behave in an undefined manner when asked to resize under an implementation that can't. And there doesn't seem to be much point in having two checks. >> +void >> +dsm_detach(dsm_segment *seg) >> +{ > > Why do we want to ignore errors like failing to unmap? ISTM that > indicates an actual problem... Sure it does. But what are you going to do about it? In many cases, you're going to get here during a transaction abort caused by some other error. If the transaction is already aborting, throwing an error here will just cause the original error to get discarded in favor of showing this one, or maybe it's the other way around. I don't remember, but it's definitely one or the other, and neither is desirable. Throwing a warning, on the other hand, will notify the user, which is what we want. Now on the flip side we might not be aborting; maybe we're committing. But we don't want to turn a commit into an abort just for this. If resowner.c detects a buffer pin leak or a tuple descriptor leak, those are "just" warning as well. They're serious warnings, of course, and if they happen it means there's a bug in the code that needs to be fixed. But the severity of an ereport() isn't based just on how alarming the situation is; it's based on what you want to happen when that situation comes up. And we've decided (correctly, I think) that resource leaks are not grounds for aborting a transaction that otherwise would have committed. > Yuck. So that's the answer to my earlier question about the legality of > seing a refcount of 1.... Read it and weep. > *same Fixed, thanks. >> + * several implementations of this facility. Many systems implement >> + * POSIX shared memory (shm_open etc.), which is well-suited to our needs >> + * in this area, with the exception that shared memory identifiers live >> + * in a flat system-wide namespace, raising the uncomfortable prospect of >> + * name collisions with other processes (including other copies of >> + * PostgreSQL) running on the same system. > > Why isn't the port number part of the posix shmem identifiers? Sure, we > retry, but using a logic similar to sysv_shmem.c seems like a good idea. According to the man page for shm_open on Solaris, "For maximum portability, name should include no more than 14 characters, but this limit is not enforced." http://www.unix.com/man-page/OpenSolaris/3c/shm_open/ I'm unclear whether there are any real systems that have a problem with this. >> +bool >> +dsm_impl_can_resize(void) >> +{ >> + switch (dynamic_shared_memory_type) >> + { >> + case DSM_IMPL_NONE: >> + return false; >> + case DSM_IMPL_SYSV: >> + return false; >> + case DSM_IMPL_WINDOWS: >> + return false; >> + default: >> + return true; >> + } >> +} > > Looks to me like the logic should be the reverse. I've changed it to list all the cases explicitly. >> + char name[64]; >> + int flags; >> + int fd; >> + char *address; >> + >> + snprintf(name, 64, "/PostgreSQL.%lu", (unsigned long) handle); > > Why wider than the handle? Same as above - not sure that uint32 == unsigned int everywhere. >> + /* >> + * If we're reattaching or resizing, we must remove any existing mapping, >> + * unless we've already got the right thing mapped. >> + */ >> + if (*mapped_address != NULL) >> + { >> + if (*mapped_size == request_size) >> + return true; > > Hm. It could have gotten resized to the old size, or resized twice. In > that case it might not be at the same address before, so checking for > the size doesn't seem to be sufficient. I don't understand your concern. If someone resize the DSM to its already-current size, there is no need to remap it. The old mapping is just fine. And if some other backend resizes the DSM to a larger size and then back to the original size, and then we're asked to update the mapping, there is no need to change anything. > Hm. You're elog(elevel) here, but the retry code in dsm_create() passes > in ERROR? Oh, that's bad. Fixed. >> +static int >> +errcode_for_dynamic_shared_memory() >> +{ >> + if (errno == EFBIG || errno == ENOMEM) >> + return errcode(ERRCODE_OUT_OF_MEMORY); >> + else >> + return errcode_for_file_access(); >> +} > > Is EFBIG guaranteed to be defined? I dunno. We could put an #ifdef around that part. Should we do that now or wait and see if it actually breaks anywhere? > Not really an issue, but this will grow owner->dsm unnecessarily because > ResourceOwnerEnlargeDSMs() will have been done previously. I tried to copy the existing uses of resowner.c as closely as possible; if you think that there's something I should be doing to mimic it more closely, let me know. > Not sure yet how happy I am with the separation of concerns between > dsm.c and dsm_impl.c... I was hoping for a cleaner abstraction break, but I couldn't make it work out any better than this. Even so, I think it's worthwhile having two files; an imperfect separation of concerns still seems better than concatenating them into one really long file. (FWIW, the combined file would be longer than 84% of the 622 .c files in the backend; as a fan of keeping .c files relatively small, I'm not eager to be the cause of us having more large ones.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления: