Обсуждение: Something fishy happening on frogmouth

Поиск

Список

Период

Сортировка

Something fishy happening on frogmouth

От

Tom Lane

Дата:

29 октября 2013 г., 19:12:25

The last two buildfarm runs on frogmouth have failed in initdb,
like this:

creating directory d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... windows
creating configuration files ... ok
creating template1 database in d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data/base/1 ... FATAL:
couldnot open shared memory segment "Global/PostgreSQL.851401618": Not enough space
 
child process exited with exit code 1

It shouldn't be failing like that, considering that we just finished
probing for acceptable max_connections and shared_buffers without hitting
any apparent limit.  I suppose it's possible that the final shm segment
size is a bit larger than what was tested at the shared_buffer step,
but that doesn't seem very likely to be the explanation.  What seems
considerably more probable is that the probe for a shared memory
implementation is screwing up the system state somehow.  It may not be
unrelated that this machine was happy before commit d2aecae went in.
        regards, tom lane

Re: Something fishy happening on frogmouth

От

Andrew Dunstan

Дата:

29 октября 2013 г., 19:47:53

On 10/29/2013 03:12 PM, Tom Lane wrote:
> The last two buildfarm runs on frogmouth have failed in initdb,
> like this:
>
> creating directory d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data ... ok
> creating subdirectories ... ok
> selecting default max_connections ... 100
> selecting default shared_buffers ... 128MB
> selecting dynamic shared memory implementation ... windows
> creating configuration files ... ok
> creating template1 database in d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data/base/1 ... FATAL:
couldnot open shared memory segment "Global/PostgreSQL.851401618": Not enough space
 
> child process exited with exit code 1
>
> It shouldn't be failing like that, considering that we just finished
> probing for acceptable max_connections and shared_buffers without hitting
> any apparent limit.  I suppose it's possible that the final shm segment
> size is a bit larger than what was tested at the shared_buffer step,
> but that doesn't seem very likely to be the explanation.  What seems
> considerably more probable is that the probe for a shared memory
> implementation is screwing up the system state somehow.  It may not be
> unrelated that this machine was happy before commit d2aecae went in.
>
>             


I'll try a run with that reverted just to see if that's it.


This is a 32 bit compiler on a 32 bit (virtual) machine, so the change 
to Size is definitely more than cosmetic here.

cheers

andrew

Re: Something fishy happening on frogmouth

От

Andrew Dunstan

Дата:

29 октября 2013 г., 20:54:45

On 10/29/2013 03:47 PM, Andrew Dunstan wrote:
>
> On 10/29/2013 03:12 PM, Tom Lane wrote:
>>  It may not be
>> unrelated that this machine was happy before commit d2aecae went in.
>>
>>
>
>
> I'll try a run with that reverted just to see if that's it.
>
>
> This is a 32 bit compiler on a 32 bit (virtual) machine, so the change 
> to Size is definitely more than cosmetic here.
>
>


And with this reverted it's perfectly happy.

cheers

andrew

Re: Something fishy happening on frogmouth

От

Amit Kapila

Дата:

30 октября 2013 г., 05:22:20

On Wed, Oct 30, 2013 at 12:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The last two buildfarm runs on frogmouth have failed in initdb,
> like this:
>
> creating directory d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data ... ok
> creating subdirectories ... ok
> selecting default max_connections ... 100
> selecting default shared_buffers ... 128MB
> selecting dynamic shared memory implementation ... windows
> creating configuration files ... ok
> creating template1 database in d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data/base/1 ... FATAL:
couldnot open shared memory segment "Global/PostgreSQL.851401618": Not enough space 
> child process exited with exit code 1

In windows implementation of dynamic shared memory, Size calculation
for creating dynamic shared memory is assuming that requested size for
creation of dynamic shared memory segment is uint64, which is changed
by commit d2aecae, so we need to change that calculation as well.
Please find the attached patch to fix this problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

bug_size_calc_dsm_win.patch

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

30 октября 2013 г., 12:23:02

On Wed, Oct 30, 2013 at 1:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Oct 30, 2013 at 12:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The last two buildfarm runs on frogmouth have failed in initdb,
>> like this:
>>
>> creating directory d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data ... ok
>> creating subdirectories ... ok
>> selecting default max_connections ... 100
>> selecting default shared_buffers ... 128MB
>> selecting dynamic shared memory implementation ... windows
>> creating configuration files ... ok
>> creating template1 database in d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data/base/1 ... FATAL:
couldnot open shared memory segment "Global/PostgreSQL.851401618": Not enough space
 
>> child process exited with exit code 1
>
> In windows implementation of dynamic shared memory, Size calculation
> for creating dynamic shared memory is assuming that requested size for
> creation of dynamic shared memory segment is uint64, which is changed
> by commit d2aecae, so we need to change that calculation as well.
> Please find the attached patch to fix this problem.

I find it hard to believe this is the right fix.  I know we have
similar code in win32_shmem.c, but surely if size is a 32-bit unsigned
quantity then size >> 0 is simply 0 anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

30 октября 2013 г., 12:23:17

On Wed, Oct 30, 2013 at 8:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Oct 30, 2013 at 1:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Wed, Oct 30, 2013 at 12:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The last two buildfarm runs on frogmouth have failed in initdb,
>>> like this:
>>>
>>> creating directory d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data ... ok
>>> creating subdirectories ... ok
>>> selecting default max_connections ... 100
>>> selecting default shared_buffers ... 128MB
>>> selecting dynamic shared memory implementation ... windows
>>> creating configuration files ... ok
>>> creating template1 database in d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data/base/1 ... FATAL:
could not open shared memory segment "Global/PostgreSQL.851401618": Not enough space
 
>>> child process exited with exit code 1
>>
>> In windows implementation of dynamic shared memory, Size calculation
>> for creating dynamic shared memory is assuming that requested size for
>> creation of dynamic shared memory segment is uint64, which is changed
>> by commit d2aecae, so we need to change that calculation as well.
>> Please find the attached patch to fix this problem.
>
> I find it hard to believe this is the right fix.  I know we have
> similar code in win32_shmem.c, but surely if size is a 32-bit unsigned
> quantity then size >> 0 is simply 0 anyway.

Err, rather, size >> 32 is simply 0 anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

30 октября 2013 г., 12:45:10

On Tue, Oct 29, 2013 at 3:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The last two buildfarm runs on frogmouth have failed in initdb,
> like this:
>
> creating directory d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data ... ok
> creating subdirectories ... ok
> selecting default max_connections ... 100
> selecting default shared_buffers ... 128MB
> selecting dynamic shared memory implementation ... windows
> creating configuration files ... ok
> creating template1 database in d:/mingw-bf/root/HEAD/pgsql.2492/src/test/regress/./tmp_check/data/base/1 ... FATAL:
couldnot open shared memory segment "Global/PostgreSQL.851401618": Not enough space
 
> child process exited with exit code 1
>
> It shouldn't be failing like that, considering that we just finished
> probing for acceptable max_connections and shared_buffers without hitting
> any apparent limit.  I suppose it's possible that the final shm segment
> size is a bit larger than what was tested at the shared_buffer step,
> but that doesn't seem very likely to be the explanation.  What seems
> considerably more probable is that the probe for a shared memory
> implementation is screwing up the system state somehow.  It may not be
> unrelated that this machine was happy before commit d2aecae went in.

If I'm reading this correctly, the last three runs on frogmouth have
all failed, and all of them have failed with a complaint about,
specifically, Global/PostgreSQL.851401618.  Now, that really shouldn't
be happening, because the code to choose that number looks like this:
       dsm_control_handle = random();

One possibility that occurs to me is that if, for some reason, we're
using the same handle every time on Windows, and if Windows takes a
bit of time to reclaim the segment after the postmaster exits (which
is not hard to believe given some previous Windows behavior I've
seen), then running the postmaster lots of times in quick succession
(as initdb does) might fail.  I dunno what that has to do with the
patch, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Andres Freund

Дата:

30 октября 2013 г., 12:47:43

On 2013-10-30 08:45:03 -0400, Robert Haas wrote:
> If I'm reading this correctly, the last three runs on frogmouth have
> all failed, and all of them have failed with a complaint about,
> specifically, Global/PostgreSQL.851401618.  Now, that really shouldn't
> be happening, because the code to choose that number looks like this:
> 
>         dsm_control_handle = random();
> 
> One possibility that occurs to me is that if, for some reason, we're
> using the same handle every time on Windows, and if Windows takes a
> bit of time to reclaim the segment after the postmaster exits (which
> is not hard to believe given some previous Windows behavior I've
> seen), then running the postmaster lots of times in quick succession
> (as initdb does) might fail.  I dunno what that has to do with the
> patch, though.

Could it be that we haven't primed the random number generator with the
time or something like that yet?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

30 октября 2013 г., 12:48:10

On Wed, Oct 30, 2013 at 8:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I find it hard to believe this is the right fix.  I know we have
> similar code in win32_shmem.c, but surely if size is a 32-bit unsigned
> quantity then size >> 0 is simply 0 anyway.

Gosh, I stand corrected.  According to
http://msdn.microsoft.com/en-us/library/336xbhcz.aspx --

"The result is undefined if the right operand of a shift expression is
negative or if the right operand is greater than or equal to the
number of bits in the (promoted) left operand. No shift operation is
performed if the right operand is zero (0)."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

30 октября 2013 г., 13:07:52

On Wed, Oct 30, 2013 at 8:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-30 08:45:03 -0400, Robert Haas wrote:
>> If I'm reading this correctly, the last three runs on frogmouth have
>> all failed, and all of them have failed with a complaint about,
>> specifically, Global/PostgreSQL.851401618.  Now, that really shouldn't
>> be happening, because the code to choose that number looks like this:
>>
>>         dsm_control_handle = random();
>>
>> One possibility that occurs to me is that if, for some reason, we're
>> using the same handle every time on Windows, and if Windows takes a
>> bit of time to reclaim the segment after the postmaster exits (which
>> is not hard to believe given some previous Windows behavior I've
>> seen), then running the postmaster lots of times in quick succession
>> (as initdb does) might fail.  I dunno what that has to do with the
>> patch, though.
>
> Could it be that we haven't primed the random number generator with the
> time or something like that yet?

Yeah, I think that's probably what it is.  There's PostmasterRandom()
to initialize the random-number generator on first use, but that
doesn't help if some other module calls random().  I wonder if we
ought to just get rid of PostmasterRandom() and instead have the
postmaster run that initialization code very early in startup.  That'd
make the timing of the random number generator being initialized a bit
more predictable, perhaps, but if the dynamic shared memory code is
going to grab a random number during startup it's basically going to
be nailed to that event anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Tom Lane

Дата:

30 октября 2013 г., 13:26:52

Robert Haas <robertmhaas@gmail.com> writes:
> If I'm reading this correctly, the last three runs on frogmouth have
> all failed, and all of them have failed with a complaint about,
> specifically, Global/PostgreSQL.851401618.  Now, that really shouldn't
> be happening, because the code to choose that number looks like this:

>         dsm_control_handle = random();

Isn't this complaining about the main shm segment, not a DSM extension?

Also, why is the error "not enough space", rather than something about
a collision?  And if this is the explanation, why didn't the previous
runs probing for allowable shmem size fail?

BTW, regardless of the specific properties of random(), surely you ought
to have code in there that would cope with a name collision.
        regards, tom lane

Re: Something fishy happening on frogmouth

От

Andres Freund

Дата:

30 октября 2013 г., 13:37:31

On 2013-10-30 09:26:42 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > If I'm reading this correctly, the last three runs on frogmouth have
> > all failed, and all of them have failed with a complaint about,
> > specifically, Global/PostgreSQL.851401618.  Now, that really shouldn't
> > be happening, because the code to choose that number looks like this:
> 
> >         dsm_control_handle = random();
> 
> Isn't this complaining about the main shm segment, not a DSM extension?

Don't think so, that has a ":" in the name. But I think this touches a
fair point, I think we need to make all the dsm error messages more
distinctive. The history since this has been committed makes it likely
that there will be more errors.

> Also, why is the error "not enough space", rather than something about
> a collision?  And if this is the explanation, why didn't the previous
> runs probing for allowable shmem size fail?

Yea, I don't think this explains the issue but something independent
that needs to be fixed.

> BTW, regardless of the specific properties of random(), surely you ought
> to have code in there that would cope with a name collision.

There actually is code that retries, but only for EEXISTS.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Something fishy happening on frogmouth

От

Tom Lane

Дата:

30 октября 2013 г., 13:49:41

Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-10-30 09:26:42 -0400, Tom Lane wrote:
>> Isn't this complaining about the main shm segment, not a DSM extension?

> Don't think so, that has a ":" in the name.

If it *isn't* about the main memory segment, what the hell are we doing
creating random addon segments during bootstrap?  None of the DSM code
should even get control at this point, IMO.
        regards, tom lane

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

30 октября 2013 г., 16:28:50

On Wed, Oct 30, 2013 at 9:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> If I'm reading this correctly, the last three runs on frogmouth have
>> all failed, and all of them have failed with a complaint about,
>> specifically, Global/PostgreSQL.851401618.  Now, that really shouldn't
>> be happening, because the code to choose that number looks like this:
>
>>         dsm_control_handle = random();
>
> Isn't this complaining about the main shm segment, not a DSM extension?

No.  That's why the identifier being assigned to has "dsm" in it.
I'll respond to this in more detail in a separate post.

> Also, why is the error "not enough space", rather than something about
> a collision?  And if this is the explanation, why didn't the previous
> runs probing for allowable shmem size fail?

Good questions.  I think that my previous theory was wrong, and that
the patch from Amit which I pushed a while ago should fix the
breakage.

> BTW, regardless of the specific properties of random(), surely you ought
> to have code in there that would cope with a name collision.

I do have code in there to cope with a name collision.  However, that
doesn't mean it's good for it to choose the same name for the segment
by default every time.  If we were going to do it that way I ought to
have just made it serial (PostgreSQL.0, 1, 2, 3, ...) instead of using
random numbers to name them.  The reason I didn't do that is to
minimize the chances of collisions actually happening - and especially
to minimize the chances of a large number of collisions happening.
Especially for System V shared memory, the namespace is rather
constrained, so bouncing around randomly through the namespace makes
it unlikely that we'll hit a whole bunch of identifiers in a row that
are all already in use by some other postmaster or, indeed, a process
unrelated to PostgreSQL.  A linear scan starting at any fixed value
wouldn't have that desirable property.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Tom Lane

Дата:

30 октября 2013 г., 16:39:13

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 30, 2013 at 9:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, why is the error "not enough space", rather than something about
>> a collision?  And if this is the explanation, why didn't the previous
>> runs probing for allowable shmem size fail?

> Good questions.  I think that my previous theory was wrong, and that
> the patch from Amit which I pushed a while ago should fix the
> breakage.

Indeed, I see frogmouth just went green, so Amit nailed it.

I'm still wondering why we try to create a DSM segment in bootstrap.
        regards, tom lane

Re: Something fishy happening on frogmouth

От

Tom Lane

Дата:

30 октября 2013 г., 16:52:03

Robert Haas <robertmhaas@gmail.com> writes:
> Yeah, I think that's probably what it is.  There's PostmasterRandom()
> to initialize the random-number generator on first use, but that
> doesn't help if some other module calls random().  I wonder if we
> ought to just get rid of PostmasterRandom() and instead have the
> postmaster run that initialization code very early in startup.

You could do arbitrary rearrangement of the postmaster's code and not
succeed in affecting this behavior in the slightest, because the
postmaster isn't running during bootstrap.  I continue to doubt that
there's a good reason to be creating DSM segment(s) here.
        regards, tom lane

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

30 октября 2013 г., 16:52:55

On Wed, Oct 30, 2013 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On 2013-10-30 09:26:42 -0400, Tom Lane wrote:
>>> Isn't this complaining about the main shm segment, not a DSM extension?
>
>> Don't think so, that has a ":" in the name.
>
> If it *isn't* about the main memory segment, what the hell are we doing
> creating random addon segments during bootstrap?  None of the DSM code
> should even get control at this point, IMO.

Here's a short summary of what I posted back in August: at system
startup time, the postmaster creates one dynamic shared segment,
called the control segment.  That segment sticks around for the
lifetime of the server and records the identity of any *other* dynamic
shared memory segments that are subsequently created.  If the server
dies a horrible death (e.g. kill -9), the next postmaster will find
the previous control segment (whose ID is written to a file in the
data directory) and remove any leftover shared memory segments from
the previous run; without this, such segments would live until the
next server reboot unless manually removed by the user (which isn't
even practical on all platforms; e.g. there doesn't seem to be any way
to list all exstant POSIX shared memory segments on MacOS X, so a user
wouldn't know which segments to remove).

For my previous posting on this topic, see the following link,
particularly the paragraph which begins "The actual implementation is
split up into two layers" and the following one.

http://www.postgresql.org/message-id/CA+TgmoaDqDUgt=4Zs_QPOnBt=EstEaVNP+5t+m=FPNWshiPR3A@mail.gmail.com

Now, you might ask why not store this control information that we need
for cleanup purposes in the *main* shared memory segment rather than
in a dynamic shared memory segment.  The basic problem is that I don't
know how to dig it out of there in any reasonable way.  The dsm
control segment is small and has a very simple structure; when the
postmaster uses the previous postmaster's leftover control segment to
clean up orphaned shared memory segments, it will ignore that old
control segment unless it passes various sanity tests.  But even if
passes those sanity tests but is corrupted somehow otherwise, nothing
that happens as a result will cause a fatal error, let alone a server
crash.  You're of course welcome to critique that logic, but I tried
my best to make it bulletproof.  See
dsm_cleanup_using_control_segment().

The structure of the main shared memory segment is way more
complicated.  If we latched onto an old main shared memory segment,
we'd presumably need to traverse ShmemIndex to even find that portion
of the shared memory segment where the DSM control information was
slated to be stored.  And there's no way that's going to be robust in
the face of a possibly-corrupted shared memory segment left over from
a previous run.   And that's actually making the assumption that we
could even do it that way, which we really can't: as of 9.3, things
like ShmemIndex are stored in the MAP_SHARED anonymous mapping, and
the System V shared memory segment is small and fixed-size.  We could
try to refactor the code so that we merge the control segment data
into the residual System V segment, but I think it'd be ugly and I'm
not sure what it really buys us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

30 октября 2013 г., 16:57:00

On Wed, Oct 30, 2013 at 12:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Yeah, I think that's probably what it is.  There's PostmasterRandom()
>> to initialize the random-number generator on first use, but that
>> doesn't help if some other module calls random().  I wonder if we
>> ought to just get rid of PostmasterRandom() and instead have the
>> postmaster run that initialization code very early in startup.
>
> You could do arbitrary rearrangement of the postmaster's code and not
> succeed in affecting this behavior in the slightest, because the
> postmaster isn't running during bootstrap.

Well, if you're telling me that it's not possible to find a way to
arrange things so that the random number is initialized before first
use, I'm gonna respectfully disagree.  If you're just critiquing my
particular suggestion about where to put that code - fair enough.
Maybe it really ought to live in our src/port implementation of
random() or pg_lrand48().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Tom Lane

Дата:

31 октября 2013 г., 01:26:42

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 30, 2013 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If it *isn't* about the main memory segment, what the hell are we doing
>> creating random addon segments during bootstrap?  None of the DSM code
>> should even get control at this point, IMO.

> Here's a short summary of what I posted back in August: at system
> startup time, the postmaster creates one dynamic shared segment,
> called the control segment.

Well, as I've pointed out already in this thread, the postmaster does not
execute during bootstrap, which makes me think this code is getting called
from the wrong place.  What possible reason is there to create add-on shm
segments in bootstrap mode?  I'm even dubious that we should create them
in standalone backends, because there will be no other process to share
them with.

I'm inclined to think this initialization should be moved to the actual
postmaster (and I mean postmaster.c) from wherever it is now.  That might
fix the not-so-random name choice in itself, but if it doesn't, then we
could consider where to move the random-seed-initialization step to.
        regards, tom lane

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

31 октября 2013 г., 02:28:59

On Wed, Oct 30, 2013 at 9:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Oct 30, 2013 at 9:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> If it *isn't* about the main memory segment, what the hell are we doing
>>> creating random addon segments during bootstrap?  None of the DSM code
>>> should even get control at this point, IMO.
>
>> Here's a short summary of what I posted back in August: at system
>> startup time, the postmaster creates one dynamic shared segment,
>> called the control segment.
>
> Well, as I've pointed out already in this thread, the postmaster does not
> execute during bootstrap, which makes me think this code is getting called
> from the wrong place.  What possible reason is there to create add-on shm
> segments in bootstrap mode?  I'm even dubious that we should create them
> in standalone backends, because there will be no other process to share
> them with.
>
> I'm inclined to think this initialization should be moved to the actual
> postmaster (and I mean postmaster.c) from wherever it is now.  That might
> fix the not-so-random name choice in itself, but if it doesn't, then we
> could consider where to move the random-seed-initialization step to.

The initialization code is currently called form
CreateSharedMemoryAndSemaphores(), like this:
   /* Initialize dynamic shared memory facilities. */   if (!IsUnderPostmaster)       dsm_postmaster_startup();

The reason I put it there is that if the postmaster does a
crash-and-restart cycle, we need create a new control segment just as
we need to create a new main shared memory segment.  (We also need to
make sure all dynamic shared memory segments left over from the
previous postmaster lifetime get nuked, but that happens earlier, as
part of the shmem_exit sequence.)

There may be a good reason to move it elsewhere, but by and large I
have not had good luck deviating from the pattern laid down for the
main shared memory segment.  My respect for that battle-tested code is
growing daily; every time I think I know better, I get burned.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Heikki Linnakangas

Дата:

31 октября 2013 г., 09:33:41

On 30.10.2013 18:52, Robert Haas wrote:
> Here's a short summary of what I posted back in August: at system
> startup time, the postmaster creates one dynamic shared segment,
> called the control segment.  That segment sticks around for the
> lifetime of the server and records the identity of any *other* dynamic
> shared memory segments that are subsequently created.  If the server
> dies a horrible death (e.g. kill -9), the next postmaster will find
> the previous control segment (whose ID is written to a file in the
> data directory) and remove any leftover shared memory segments from
> the previous run; without this, such segments would live until the
> next server reboot unless manually removed by the user (which isn't
> even practical on all platforms; e.g. there doesn't seem to be any way
> to list all exstant POSIX shared memory segments on MacOS X, so a user
> wouldn't know which segments to remove).

Wait, that sounds horrible. If you kill -9 the server, and then rm -rf 
$PGDATA, the shared memory segment is leaked until next reboot? I find 
that unacceptable. There are many scenarios where you never restart 
postmaster after a crash. For example, if you have an automatic failover 
setup; you fail over to the standby in case of crash, and re-initialize 
the old master with e.g rsync.

- Heikki

Re: Something fishy happening on frogmouth

От

Andres Freund

Дата:

31 октября 2013 г., 09:50:28

Hi,

On 2013-10-31 11:33:28 +0200, Heikki Linnakangas wrote:
> Wait, that sounds horrible. If you kill -9 the server, and then rm -rf
> $PGDATA, the shared memory segment is leaked until next reboot? I find that
> unacceptable. There are many scenarios where you never restart postmaster
> after a crash. For example, if you have an automatic failover setup; you
> fail over to the standby in case of crash, and re-initialize the old master
> with e.g rsync.

Our main shared memory segment works the same way, doesn't it? And it
has for a long time.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

31 октября 2013 г., 12:06:58

On Thu, Oct 31, 2013 at 5:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-31 11:33:28 +0200, Heikki Linnakangas wrote:
>> Wait, that sounds horrible. If you kill -9 the server, and then rm -rf
>> $PGDATA, the shared memory segment is leaked until next reboot? I find that
>> unacceptable. There are many scenarios where you never restart postmaster
>> after a crash. For example, if you have an automatic failover setup; you
>> fail over to the standby in case of crash, and re-initialize the old master
>> with e.g rsync.
>
> Our main shared memory segment works the same way, doesn't it? And it
> has for a long time.

It does, and what's the alternative, anyway?  I mean, if the user or
the system decides to terminate all of the postgres processes on the
machine with extreme prejudice, like kill -9, we can't do anything
afterwards, and we can't do anything beforehand, either.  Of course,
it would be nice if there were an operating system API that said -
give me a named shared memory segment that automatically goes away
when the last active reference is gone.  But, except on Windows, no
such API appears to exist.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Tom Lane

Дата:

31 октября 2013 г., 14:29:27

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Oct 31, 2013 at 5:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2013-10-31 11:33:28 +0200, Heikki Linnakangas wrote:
>>> Wait, that sounds horrible. If you kill -9 the server, and then rm -rf
>>> $PGDATA, the shared memory segment is leaked until next reboot?

>> Our main shared memory segment works the same way, doesn't it? And it
>> has for a long time.

> It does, and what's the alternative, anyway?

Well, what we expect from the existing shmem code is that restarting the
postmaster will clean things up, ie find and destroy the leaked shmem.
It sounds to me like this may not work like that, in which case I agree
with Heikki that it's not really acceptable.

Maybe, rather than trying to make the control segment's name random,
we should derive it from the data directory inode number, or some such?
That way we could find it reliably during restart.
        regards, tom lane

Re: Something fishy happening on frogmouth

От

Andres Freund

Дата:

31 октября 2013 г., 14:43:23

On 2013-10-31 10:29:17 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Oct 31, 2013 at 5:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> On 2013-10-31 11:33:28 +0200, Heikki Linnakangas wrote:
> >>> Wait, that sounds horrible. If you kill -9 the server, and then rm -rf
> >>> $PGDATA, the shared memory segment is leaked until next reboot?
> 
> >> Our main shared memory segment works the same way, doesn't it? And it
> >> has for a long time.
> 
> > It does, and what's the alternative, anyway?
> 
> Well, what we expect from the existing shmem code is that restarting the
> postmaster will clean things up, ie find and destroy the leaked shmem.
> It sounds to me like this may not work like that, in which case I agree
> with Heikki that it's not really acceptable.

The code writes a state file containing the identity of the dsm "control
segment" used last time. That state file is read at startup and if it
still exists used to attach to the control segment which contains a list
of "user defined" dsm segments so they can be cleaned up

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

31 октября 2013 г., 14:43:37

On Thu, Oct 31, 2013 at 10:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Oct 31, 2013 at 5:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> On 2013-10-31 11:33:28 +0200, Heikki Linnakangas wrote:
>>>> Wait, that sounds horrible. If you kill -9 the server, and then rm -rf
>>>> $PGDATA, the shared memory segment is leaked until next reboot?
>
>>> Our main shared memory segment works the same way, doesn't it? And it
>>> has for a long time.
>
>> It does, and what's the alternative, anyway?
>
> Well, what we expect from the existing shmem code is that restarting the
> postmaster will clean things up, ie find and destroy the leaked shmem.
> It sounds to me like this may not work like that, in which case I agree
> with Heikki that it's not really acceptable.

I'm getting a little frustrated.  It *does* work like that.  I sent an
email explaining that yesterday, and Andres sent another one this
morning.

Let me say this again: the dynamic shared memory code *does* clean up
after itself.  If you kill -9 the postmaster and all of its children,
you'll orphan the main shared memory segment and any dynamic shared
memory segments that exist.  There is nothing we can do about that.
When you restart the postmaster, both the main shared memory segment
and any dynamic shared memory segments orphaned by the previous kill
will be cleaned up.  I spent a lot of time trying to make sure that
the handling of dynamic shared memory segments is, in all cases, as
parallel to the handling of the main shared memory segment as
possible.  There should be no cases where the main shared memory
segment gets cleaned up and the dynamic shared memory segments do not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Heikki Linnakangas

Дата:

31 октября 2013 г., 23:49:59

On 31.10.2013 16:43, Robert Haas wrote:
> Let me say this again: the dynamic shared memory code *does* clean up
> after itself.  If you kill -9 the postmaster and all of its children,
> you'll orphan the main shared memory segment and any dynamic shared
> memory segments that exist.  There is nothing we can do about that.
> When you restart the postmaster, both the main shared memory segment
> and any dynamic shared memory segments orphaned by the previous kill
> will be cleaned up.  I spent a lot of time trying to make sure that
> the handling of dynamic shared memory segments is, in all cases, as
> parallel to the handling of the main shared memory segment as
> possible.  There should be no cases where the main shared memory
> segment gets cleaned up and the dynamic shared memory segments do not.

1. initdb -D data1
2. initdb -D data2
3. postgres -D data1
4. killall -9 postgres
5. postgres -D data2

The system V shmem segment orphaned at step 4 will be cleaned up at step 
5. The DSM segment will not.

BTW, 9.3 actually made the situation a lot better for the main memory 
segment. You only leak the small interlock shmem segment, the large 
mmap'd block does get automatically free'd when the last process using 
it exits.

- Heikki

Re: Something fishy happening on frogmouth

От

Robert Haas

Дата:

01 ноября 2013 г., 04:27:37

On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 31.10.2013 16:43, Robert Haas wrote:
>> There should be no cases where the main shared memory
>> segment gets cleaned up and the dynamic shared memory segments do not.
>
> 1. initdb -D data1
> 2. initdb -D data2
> 3. postgres -D data1
> 4. killall -9 postgres
> 5. postgres -D data2
>
> The system V shmem segment orphaned at step 4 will be cleaned up at step 5.
> The DSM segment will not.

OK, true.  However, the fact that that "works" relies on the fact that
you've got two postmasters configured to running on the same port,
which in practice is a rather unlikely configuration.  And even if you
do have that configuration, I'm not sure that it's a feature that they
can interfere with each other like that.  Do you think it is?

If we want the behavior, we could mimic what the main shared memory
code does here: instead of choosing a random value for the control
segment identifier and saving it in a state file, start with something
like port * 100 + 1000000 (the main shared memory segment uses port *
100, so we'd want something at least slightly different) and search
forward one value at a time from there until we find an unused ID.

> BTW, 9.3 actually made the situation a lot better for the main memory
> segment. You only leak the small interlock shmem segment, the large mmap'd
> block does get automatically free'd when the last process using it exits.

Yeah.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Something fishy happening on frogmouth

От

Noah Misch

Дата:

01 ноября 2013 г., 16:28:06

On Fri, Nov 01, 2013 at 12:27:31AM -0400, Robert Haas wrote:
> On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> > On 31.10.2013 16:43, Robert Haas wrote:
> >> There should be no cases where the main shared memory
> >> segment gets cleaned up and the dynamic shared memory segments do not.
> >
> > 1. initdb -D data1
> > 2. initdb -D data2
> > 3. postgres -D data1
> > 4. killall -9 postgres
> > 5. postgres -D data2
> >
> > The system V shmem segment orphaned at step 4 will be cleaned up at step 5.
> > The DSM segment will not.

Note that dynamic_shared_memory_type='mmap' will give the desired behavior.

> If we want the behavior, we could mimic what the main shared memory
> code does here: instead of choosing a random value for the control
> segment identifier and saving it in a state file, start with something
> like port * 100 + 1000000 (the main shared memory segment uses port *
> 100, so we'd want something at least slightly different) and search
> forward one value at a time from there until we find an unused ID.

This approach used for the main sysv segment has its own problems.  If the
first postmaster had to search forward but the second postmaster does not, the
second postmaster will not reach the old segment to clean it up.

It might be suitably-cheap insurance to store the DSM control segment handle
in PGShmemHeader.  Then if, by whatever means good or bad, we find a main sysv
segment to clean up, we can always clean up the associated DSM segment(s).

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Re: Something fishy happening on frogmouth

От

Heikki Linnakangas

Дата:

04 ноября 2013 г., 08:28:06

On 01.11.2013 18:22, Noah Misch wrote:
> On Fri, Nov 01, 2013 at 12:27:31AM -0400, Robert Haas wrote:
>> On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
>>> On 31.10.2013 16:43, Robert Haas wrote:
>>>> There should be no cases where the main shared memory
>>>> segment gets cleaned up and the dynamic shared memory segments do not.
>>>
>>> 1. initdb -D data1
>>> 2. initdb -D data2
>>> 3. postgres -D data1
>>> 4. killall -9 postgres
>>> 5. postgres -D data2
>>>
>>> The system V shmem segment orphaned at step 4 will be cleaned up at step 5.
>>> The DSM segment will not.
>
> Note that dynamic_shared_memory_type='mmap' will give the desired behavior.

Hmm, here's another idea:

Postmaster creates the POSIX shared memory object at startup, by calling 
shm_open(), and immediately calls shm_unlink on it. That way, once all 
the processes have exited, the object will be removed automatically. 
Child processes inherit the file descriptor at fork(), and don't need to 
call shm_open, just mmap().

I'm not sure how dynamic these segments need to be, but if 1-2 such file 
descriptors is not enough, you could mmap() different offsets from the 
same shmem object for different segments.

- Heikki

Re: Something fishy happening on frogmouth

От

Andres Freund

Дата:

04 ноября 2013 г., 09:55:33

On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote:
> On 01.11.2013 18:22, Noah Misch wrote:
> >On Fri, Nov 01, 2013 at 12:27:31AM -0400, Robert Haas wrote:
> >>On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
> >>>On 31.10.2013 16:43, Robert Haas wrote:
> >>>>There should be no cases where the main shared memory
> >>>>segment gets cleaned up and the dynamic shared memory segments do not.
> >>>
> >>>1. initdb -D data1
> >>>2. initdb -D data2
> >>>3. postgres -D data1
> >>>4. killall -9 postgres
> >>>5. postgres -D data2
> >>>
> >>>The system V shmem segment orphaned at step 4 will be cleaned up at step 5.
> >>>The DSM segment will not.
> >
> >Note that dynamic_shared_memory_type='mmap' will give the desired behavior.

Well, with the significant price of causing file-io.

> Hmm, here's another idea:
> 
> Postmaster creates the POSIX shared memory object at startup, by calling
> shm_open(), and immediately calls shm_unlink on it. That way, once all the
> processes have exited, the object will be removed automatically. Child
> processes inherit the file descriptor at fork(), and don't need to call
> shm_open, just mmap().

Uh. Won't that completely and utterly remove the point of dsm which is
that you can create segments *after* startup? We surely don't want to
start overallocating enough shmem so we don't ever dynamically need to
allocate segments.
Also, I don't think it's portable across platforms to access segments
that already have been unlinked.

I think this is looking for a solution without an actually relevant
problem disregarding the actual problem space.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Something fishy happening on frogmouth

От

Heikki Linnakangas

Дата:

04 ноября 2013 г., 11:13:45

On 04.11.2013 11:55, Andres Freund wrote:
> On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote:
>> Hmm, here's another idea:
>>
>> Postmaster creates the POSIX shared memory object at startup, by calling
>> shm_open(), and immediately calls shm_unlink on it. That way, once all the
>> processes have exited, the object will be removed automatically. Child
>> processes inherit the file descriptor at fork(), and don't need to call
>> shm_open, just mmap().
>
> Uh. Won't that completely and utterly remove the point of dsm which is
> that you can create segments *after* startup? We surely don't want to
> start overallocating enough shmem so we don't ever dynamically need to
> allocate segments.

You don't need to allocate the shared memory beforehand, only create the 
file descriptor. Note that the size of the segment is specified in the 
shm_open() call, but the mmap() that comes later.

If we need a large amount of small segments, so that it's not feasible 
to shm_open() them all at postmaster startup, you could shm_open() just 
one segment, and carve out smaller segments from it by mmap()ing at 
different offsets.

> Also, I don't think it's portable across platforms to access segments
> that already have been unlinked.

See 
http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html: "If 
one or more references to the shared memory object exist when the object 
is unlinked, the name shall be removed before shm_unlink() returns, but 
the removal of the memory object contents shall be postponed until all 
open and map references to the shared memory object have been removed."

That doesn't explicitly say that a new shm_open() on the file descriptor 
must still work after shm_unlink, but I would be surprised if there is a 
platform where it doesn't.

> I think this is looking for a solution without an actually relevant
> problem disregarding the actual problem space.

I agree. What are these dynamic shared memory segments supposed to be 
used for?

- Heikki

Re: Something fishy happening on frogmouth

От

Andres Freund

Дата:

04 ноября 2013 г., 11:19:21

On 2013-11-04 13:13:27 +0200, Heikki Linnakangas wrote:
> On 04.11.2013 11:55, Andres Freund wrote:
> >On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote:
> >>Postmaster creates the POSIX shared memory object at startup, by calling
> >>shm_open(), and immediately calls shm_unlink on it. That way, once all the
> >>processes have exited, the object will be removed automatically. Child
> >>processes inherit the file descriptor at fork(), and don't need to call
> >>shm_open, just mmap().
> >
> >Uh. Won't that completely and utterly remove the point of dsm which is
> >that you can create segments *after* startup? We surely don't want to
> >start overallocating enough shmem so we don't ever dynamically need to
> >allocate segments.

> You don't need to allocate the shared memory beforehand, only create the
> file descriptor. Note that the size of the segment is specified in the
> shm_open() call, but the mmap() that comes later.
>
> If we need a large amount of small segments, so that it's not feasible to
> shm_open() them all at postmaster startup, you could shm_open() just one
> segment, and carve out smaller segments from it by mmap()ing at different
> offsets.

That quickly will result in fragmentation which we don't have the tools
to handle.

> >Also, I don't think it's portable across platforms to access segments
> >that already have been unlinked.
> 
> See
> http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html:
> "If one or more references to the shared memory object exist when the object
> is unlinked, the name shall be removed before shm_unlink() returns, but the
> removal of the memory object contents shall be postponed until all open and
> map references to the shared memory object have been removed."

We also support sysv shmem and have the same cleanup problem there.

> That doesn't explicitly say that a new shm_open() on the file descriptor
> must still work after shm_unlink, but I would be surprised if there is a
> platform where it doesn't.

Probably true.

> >I think this is looking for a solution without an actually relevant
> >problem disregarding the actual problem space.

To make that clearer: I think the discussions about making it impossible
to leak segments after rm -rf are the irrelevant problem.

> I agree. What are these dynamic shared memory segments supposed to be used
> for?

Parallel sort and stuff like that.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Something fishy happening on frogmouth

От

Tom Lane

Дата:

04 ноября 2013 г., 15:06:05

Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-11-04 13:13:27 +0200, Heikki Linnakangas wrote:
>> On 04.11.2013 11:55, Andres Freund wrote:
>>> Also, I don't think it's portable across platforms to access segments
>>> that already have been unlinked.

>> See
>> http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html:
>> "If one or more references to the shared memory object exist when the object
>> is unlinked, the name shall be removed before shm_unlink() returns, but the
>> removal of the memory object contents shall be postponed until all open and
>> map references to the shared memory object have been removed."

> We also support sysv shmem and have the same cleanup problem there.

And what about Windows?
        regards, tom lane

Re: Something fishy happening on frogmouth

От

Noah Misch

Дата:

16 сентября 2018 г., 01:15:46

On Wed, Oct 30, 2013 at 09:07:43AM -0400, Robert Haas wrote:
> On Wed, Oct 30, 2013 at 8:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-30 08:45:03 -0400, Robert Haas wrote:
> >> If I'm reading this correctly, the last three runs on frogmouth have
> >> all failed, and all of them have failed with a complaint about,
> >> specifically, Global/PostgreSQL.851401618.  Now, that really shouldn't
> >> be happening, because the code to choose that number looks like this:
> >>
> >>         dsm_control_handle = random();

> > Could it be that we haven't primed the random number generator with the
> > time or something like that yet?
> 
> Yeah, I think that's probably what it is.

I experienced a variation of this, namely a RHEL 7 system where initdb always
says "selecting dynamic shared memory implementation ... sysv".  Each initdb
is rejecting posix shm by probing the same ten segments:

$ strace initdb -D scratch 2>&1 | grep /dev/shm/P
open("/dev/shm/PostgreSQL.1804289383", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)
open("/dev/shm/PostgreSQL.846930886", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)
open("/dev/shm/PostgreSQL.1681692777", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)
open("/dev/shm/PostgreSQL.1714636915", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)
open("/dev/shm/PostgreSQL.1957747793", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)
open("/dev/shm/PostgreSQL.424238335", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)
open("/dev/shm/PostgreSQL.719885386", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)
open("/dev/shm/PostgreSQL.1649760492", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)
open("/dev/shm/PostgreSQL.596516649", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)
open("/dev/shm/PostgreSQL.1189641421", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EEXIST (File exists)

Regular postmaster runs choose a random segment, but initdb, bootstrap
postgres, and single-user postgres all start with the same segment.  These
segments are months old.  Perhaps I was testing something that caused a
bootstrap postgres to crash.  After ten such crashes, future initdb runs
considered posix shm unusable.

> There's PostmasterRandom()
> to initialize the random-number generator on first use, but that
> doesn't help if some other module calls random().  I wonder if we
> ought to just get rid of PostmasterRandom() and instead have the
> postmaster run that initialization code very early in startup.

Usually, the first srandom() call happens early in PostmasterMain().  I plan
to add one to InitStandaloneProcess(), which substitutes for several tasks
otherwise done in PostmasterMain().  That seems like a good thing even if DSM
weren't in the picture.  Also, initdb needs an srandom() somewhere;
choose_dsm_implementation() itself seems fine.  Attached.  With this, "make
-j20 check-world" selected posix shm and passed even when I forced DSM
creation to fail on unseeded random():

--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -249,2 +249,5 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size,
 
+    if (handle == 1804289383)
+        elog(ERROR, "generated handle with no randomness");
+
     snprintf(name, 64, "/PostgreSQL.%u", handle);

Вложения

dsm-standalone-random-v1.patch

Re: Something fishy happening on frogmouth

От

Tom Lane

Дата:

16 сентября 2018 г., 01:21:52

Noah Misch <noah@leadboat.com> writes:
> Usually, the first srandom() call happens early in PostmasterMain().  I plan
> to add one to InitStandaloneProcess(), which substitutes for several tasks
> otherwise done in PostmasterMain().  That seems like a good thing even if DSM
> weren't in the picture.  Also, initdb needs an srandom() somewhere;
> choose_dsm_implementation() itself seems fine.  Attached.

+1, but some comments would be good.

            regards, tom lane

Re: Something fishy happening on frogmouth

От

Kyotaro HORIGUCHI

Дата:

19 сентября 2018 г., 07:03:44

Thank you for finding and fixing this.

At Sat, 15 Sep 2018 18:21:52 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <15555.1537050112@sss.pgh.pa.us>
> Noah Misch <noah@leadboat.com> writes:
> > Usually, the first srandom() call happens early in PostmasterMain().  I plan
> > to add one to InitStandaloneProcess(), which substitutes for several tasks
> > otherwise done in PostmasterMain().  That seems like a good thing even if DSM
> > weren't in the picture.  Also, initdb needs an srandom() somewhere;
> > choose_dsm_implementation() itself seems fine.  Attached.
> 
> +1, but some comments would be good.
> 
>             regards, tom lane
> 

+1, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Something fishy happening on frogmouth

От

Noah Misch

Дата:

24 сентября 2018 г., 09:32:41

On Wed, Sep 19, 2018 at 01:03:44PM +0900, Kyotaro HORIGUCHI wrote:
> Thank you for finding and fixing this.
> 
> At Sat, 15 Sep 2018 18:21:52 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <15555.1537050112@sss.pgh.pa.us>
> > Noah Misch <noah@leadboat.com> writes:
> > > Usually, the first srandom() call happens early in PostmasterMain().  I plan
> > > to add one to InitStandaloneProcess(), which substitutes for several tasks
> > > otherwise done in PostmasterMain().  That seems like a good thing even if DSM
> > > weren't in the picture.  Also, initdb needs an srandom() somewhere;
> > > choose_dsm_implementation() itself seems fine.  Attached.
> > 
> > +1, but some comments would be good.

> +1, too.

Thanks for reviewing.  I pushed with some comments.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Something fishy happening on frogmouth

Вложения

Вложения