Обсуждение: [HACKERS] src/test/subscription/t/002_types.pl hanging on particular environment

Поиск

Список

Период

Сортировка

[HACKERS] src/test/subscription/t/002_types.pl hanging on particular environment

От

Thomas Munro

Дата:

18 сентября 2017 г., 15:57:04

Hi,

The subscription tests 002_types.pl sometimes hangs for a while and
then times out when run on a Travis CI build VM running Ubuntu Trusty
if --enable-coverage is used. I guess it might be a timing/race
problem because I can't think of any mechanism specific to coverage
instrumentation except that it slows stuff down, but I don't know.
The only other thing I can think of is that perhaps the instrumented
code is writing something to stdout or stderr and that's finding its
way into some protocol stream and confusing things, but I can't
reproduce this on any of my usual development machines. I haven't
tried that exact operating system. Maybe it's a bug in the toolchain,
but that's an Ubuntu LTS release so I assume other people use it
(though I don't see it in the buildfarm).

Example:

t/001_rep_changes.pl .. ok
t/002_types.pl ........ #
# Looks like your test exited with 29 before it could output anything.
t/002_types.pl ........ Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 3/3 subtests
t/003_constraints.pl .. ok
t/004_sync.pl ......... ok
t/005_encoding.pl ..... ok
t/007_ddl.pl .......... ok
Test Summary Report
-------------------
t/002_types.pl (Wstat: 7424 Tests: 0 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 3
testsbut ran 0.

Before I figured out that --coverage was relevant, I wondered if the
recent commit 8edacab209957520423770851351ab4013cb0167 which landed
around the time I spotted this might have something to do with it, but
I tried reverting the code change in there and it didn't help. Here's
a build log:
https://travis-ci.org/macdice/postgres/jobs/276752803

As you can see the script used was:
./configure --enable-debug --enable-cassert --enable-tap-tests
--enable-coverage && make -j4 all contrib && make -C
src/test/subscription check

In this build you can see the output of the following at the end,
which might provide clues to the initiated. You might need to click a
small triangle to unfold the commands' output.
cat ./src/test/subscription/tmp_check/log/002_types_publisher.log cat
./src/test/subscription/tmp_check/log/002_types_subscriber.log cat
./src/test/subscription/tmp_check/log/regress_log_002_types

There are messages like this:
WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded
thisserver process to roll

back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the
database
and repeat your command.

As far as I know these are misleading, it's really just an immediate
shutdown. There is no core file, so I don't believe anything actually
crashed.

Here's a version that's the same except it doesn't have
--enable-coverage. It passes:
https://travis-ci.org/macdice/postgres/jobs/276771654

Any ideas?

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Andres Freund

Дата:

18 сентября 2017 г., 16:18:10

Hi,

On 2017-09-18 21:57:04 +1200, Thomas Munro wrote:
> The subscription tests 002_types.pl sometimes hangs for a while and
> then times out when run on a Travis CI build VM running Ubuntu Trusty
> if --enable-coverage is used.

Yea, I saw that too.


> I guess it might be a timing/race
> problem because I can't think of any mechanism specific to coverage
> instrumentation except that it slows stuff down, but I don't know.
> The only other thing I can think of is that perhaps the instrumented
> code is writing something to stdout or stderr and that's finding its
> way into some protocol stream and confusing things, but I can't
> reproduce this on any of my usual development machines.  I haven't
> tried that exact operating system.  Maybe it's a bug in the toolchain,
> but that's an Ubuntu LTS release so I assume other people use it
> (though I don't see it in the buildfarm).

I've run this locally through a number of iterations with coverage
enabled, I think I reproduced it once, but unfortunately I'd continued
because I was working on something else at that moment.


It might be worthwhile to play around with replacing the or die in
my $synced_query =
"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('s', 'r');";
$node_subscriber->poll_query_until('postgres', $synced_query) or die "Timed out while waiting for subscriber to
synchronizedata";
 
with something like
or diag($node_subscriber->safe_psql('postgres', 'SELECT * FROM pg_subscription_rel')
just to know where to go from here.


>     WARNING:  terminating connection because of crash of another server process
>     DETAIL:  The postmaster has commanded this server process to roll
> back the current transaction and exit, because another server process
> exited abnormally and possibly corrupted shared memory.
>     HINT:  In a moment you should be able to reconnect to the database
> and repeat your command.
> 
> As far as I know these are misleading, it's really just an immediate
> shutdown.  There is no core file, so I don't believe anything actually
> crashed.

I was about to complain about these, for entirely unrelated reasons. I
think it's a bad idea - and there's a couple complains on the lists too,
to emit these warnings.  It's not entirely trivial to fix though :(

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging on particular environment

От

Tom Lane

Дата:

18 сентября 2017 г., 20:16:46

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> In this build you can see the output of the following at the end,
> which might provide clues to the initiated.  You might need to click a
> small triangle to unfold the commands' output.

>     cat ./src/test/subscription/tmp_check/log/002_types_publisher.log
>     cat ./src/test/subscription/tmp_check/log/002_types_subscriber.log
>     cat ./src/test/subscription/tmp_check/log/regress_log_002_types

The subscriber log includes
2017-09-18 08:43:08.240 UTC [15672] WARNING:  out of background worker slots
2017-09-18 08:43:08.240 UTC [15672] HINT:  You might need to increase max_worker_processes.

Maybe that's harmless, but I'm suspicious that it's a smoking gun.
I think perhaps this reflects a failed attempt to launch a worker,
which the caller does not realize has failed to launch because of the
lack of worker-fork-failure error recovery I bitched about months ago
[1], leading to subscription startup waiting forever for a worker that's
never going to report finishing.

I see Amit K. just posted a patch in that area [2], haven't looked at it
yet.
        regards, tom lane

[1] https://www.postgresql.org/message-id/4905.1492813727@sss.pgh.pa.us
[2] https://www.postgresql.org/message-id/CAA4eK1KDfKkvrjxsKJi3WPyceVi3dH1VCkbTJji2fuwKuB=3uw@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging on particular environment

От

Tom Lane

Дата:

18 сентября 2017 г., 21:01:44

I wrote:
> The subscriber log includes
> 2017-09-18 08:43:08.240 UTC [15672] WARNING:  out of background worker slots
> 2017-09-18 08:43:08.240 UTC [15672] HINT:  You might need to increase max_worker_processes.

> Maybe that's harmless, but I'm suspicious that it's a smoking gun.

Actually: looking at the place where this is issued, namely
logicalrep_worker_launch, and comparing it to the cleanup logic
in WaitForReplicationWorkerAttach, it looks to me like the problem
is we're failing to do logicalrep_worker_cleanup() in this case.
We have initialized a logical rep worker slot, and we're just
walking away from it.

I'll go see if I can't reproduce this by injecting random failures right
there.
        regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Amit Kapila

Дата:

18 сентября 2017 г., 21:14:46

On Mon, Sep 18, 2017 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> In this build you can see the output of the following at the end,
>> which might provide clues to the initiated.  You might need to click a
>> small triangle to unfold the commands' output.
>
>>     cat ./src/test/subscription/tmp_check/log/002_types_publisher.log
>>     cat ./src/test/subscription/tmp_check/log/002_types_subscriber.log
>>     cat ./src/test/subscription/tmp_check/log/regress_log_002_types
>
> The subscriber log includes
> 2017-09-18 08:43:08.240 UTC [15672] WARNING:  out of background worker slots
> 2017-09-18 08:43:08.240 UTC [15672] HINT:  You might need to increase max_worker_processes.
>
> Maybe that's harmless, but I'm suspicious that it's a smoking gun.
> I think perhaps this reflects a failed attempt to launch a worker,
> which the caller does not realize has failed to launch because of the
> lack of worker-fork-failure error recovery I bitched about months ago
> [1], leading to subscription startup waiting forever for a worker that's
> never going to report finishing.
>
> I see Amit K. just posted a patch in that area [2], haven't looked at it
> yet.
>

I have noticed while fixing the issue you are talking that this path
is also susceptible to such problems.  In
WaitForReplicationWorkerAttach(), it relies on
GetBackgroundWorkerPid() to know the status of the worker which won't
give the correct status in case of fork failure.  The patch I have
posted has a fix for the issue, however, this could be an entirely
different issue altogether as it appears from your next email in this
thread.


[1] - https://www.postgresql.org/message-id/CAA4eK1KDfKkvrjxsKJi3WPyceVi3dH1VCkbTJji2fuwKuB%3D3uw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging on particular environment

От

Tom Lane

Дата:

18 сентября 2017 г., 22:42:24

Amit Kapila <amit.kapila16@gmail.com> writes:
> On Mon, Sep 18, 2017 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The subscriber log includes
>> 2017-09-18 08:43:08.240 UTC [15672] WARNING:  out of background worker slots
>> Maybe that's harmless, but I'm suspicious that it's a smoking gun.

> I have noticed while fixing the issue you are talking that this path
> is also susceptible to such problems.  In
> WaitForReplicationWorkerAttach(), it relies on
> GetBackgroundWorkerPid() to know the status of the worker which won't
> give the correct status in case of fork failure.  The patch I have
> posted has a fix for the issue,

Your GetBackgroundWorkerPid fix looks good as far as it goes, but
I feel that WaitForReplicationWorkerAttach's problems go way deeper
than that --- in fact, that function should not exist at all IMO.

The way that launcher.c is set up, it's relying on either the calling
process or the called process to free the logicalrep slot when done.
The called process might never exist at all, so the second half of
that is obviously unreliable; but WaitForReplicationWorkerAttach
can hardly be relied on either, seeing it's got a big fat exit-on-
SIGINT in it.  I thought about putting a PG_TRY, or more likely
PG_ENSURE_ERROR_CLEANUP, into it, but that doesn't fix the basic
problem: you can't assume that the worker isn't ever going to start,
just because you got a signal that means you shouldn't wait anymore.

Now, this rickety business might be fine if it were only about the
states of the caller and called processes, but we've got long-lived
shared data involved (ie the worker slots); failing to clean it up
is not an acceptable outcome.

So, frankly, I think we would be best off losing the "logical rep
worker slot" business altogether, and making do with just bgworker
slots.  The alternative is getting the postmaster involved in cleaning
up lrep slots as well as bgworker slots, and I'm going to resist
any such idea strongly.  That way madness lies, or at least an
unreliable postmaster --- if we need lrep slots, what other forty-seven
data structures are we going to need the postmaster to clean up
in the future?

I haven't studied the code enough to know why it grew lrep worker
slots in the first place.  Maybe someone could explain?
        regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Petr Jelinek

Дата:

19 сентября 2017 г., 16:04:48

n 18/09/17 18:42, Tom Lane wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> On Mon, Sep 18, 2017 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The subscriber log includes
>>> 2017-09-18 08:43:08.240 UTC [15672] WARNING:  out of background worker slots
>>> Maybe that's harmless, but I'm suspicious that it's a smoking gun.
> 
>> I have noticed while fixing the issue you are talking that this path
>> is also susceptible to such problems.  In
>> WaitForReplicationWorkerAttach(), it relies on
>> GetBackgroundWorkerPid() to know the status of the worker which won't
>> give the correct status in case of fork failure.  The patch I have
>> posted has a fix for the issue,
> 
> Your GetBackgroundWorkerPid fix looks good as far as it goes, but
> I feel that WaitForReplicationWorkerAttach's problems go way deeper
> than that --- in fact, that function should not exist at all IMO.
> 
> The way that launcher.c is set up, it's relying on either the calling
> process or the called process to free the logicalrep slot when done.
> The called process might never exist at all, so the second half of
> that is obviously unreliable; but WaitForReplicationWorkerAttach
> can hardly be relied on either, seeing it's got a big fat exit-on-
> SIGINT in it.  I thought about putting a PG_TRY, or more likely
> PG_ENSURE_ERROR_CLEANUP, into it, but that doesn't fix the basic
> problem: you can't assume that the worker isn't ever going to start,
> just because you got a signal that means you shouldn't wait anymore.
> 
> Now, this rickety business might be fine if it were only about the
> states of the caller and called processes, but we've got long-lived
> shared data involved (ie the worker slots); failing to clean it up
> is not an acceptable outcome.
> 

We'll clean it up eventually if somebody requests creation of new
logical replication worker and that slot is needed. See the "garbage
collection" part in logicalrep_worker_launch(). I know it's not ideal
solution, but it's the best one I could think of given the bellow.

> So, frankly, I think we would be best off losing the "logical rep
> worker slot" business altogether, and making do with just bgworker
> slots.  The alternative is getting the postmaster involved in cleaning
> up lrep slots as well as bgworker slots, and I'm going to resist
> any such idea strongly.  That way madness lies, or at least an
> unreliable postmaster --- if we need lrep slots, what other forty-seven
> data structures are we going to need the postmaster to clean up
> in the future?
> 
> I haven't studied the code enough to know why it grew lrep worker
> slots in the first place.  Maybe someone could explain?
> 

I am not quite sure I understand this question, we need to store
additional info about workers in shared memory so we need slots for that.

If you are asking why they are not identified by the
BackgroundWorkerHandle, then it's because it's private struct and can't
be shared with other processes so there is no way to link the logical
worker info with bgworker directly. I mentioned that I am not happy
about this during the original development but it didn't gather any
attention which I took to mean that nobody minds.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Amit Kapila

Дата:

19 сентября 2017 г., 18:33:56

On Tue, Sep 19, 2017 at 3:34 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> n 18/09/17 18:42, Tom Lane wrote:
>
>> So, frankly, I think we would be best off losing the "logical rep
>> worker slot" business altogether, and making do with just bgworker
>> slots.

I think that would be cleaner as compared to what we have now.

>>  The alternative is getting the postmaster involved in cleaning
>> up lrep slots as well as bgworker slots, and I'm going to resist
>> any such idea strongly.  That way madness lies, or at least an
>> unreliable postmaster --- if we need lrep slots, what other forty-seven
>> data structures are we going to need the postmaster to clean up
>> in the future?
>>
>> I haven't studied the code enough to know why it grew lrep worker
>> slots in the first place.  Maybe someone could explain?
>>
>
> I am not quite sure I understand this question, we need to store
> additional info about workers in shared memory so we need slots for that.
>

Yeah, but you could have used the way we do for parallel query where
we setup dsm and share all such information.  You can check the logic
of execparallel.c and parallel.c to see how we do all such stuff for
parallel query.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Petr Jelinek

Дата:

19 сентября 2017 г., 18:59:38

On 19/09/17 14:33, Amit Kapila wrote:
> On Tue, Sep 19, 2017 at 3:34 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> n 18/09/17 18:42, Tom Lane wrote:
>>
>>> So, frankly, I think we would be best off losing the "logical rep
>>> worker slot" business altogether, and making do with just bgworker
>>> slots.
> 
> I think that would be cleaner as compared to what we have now.
> 
>>>  The alternative is getting the postmaster involved in cleaning
>>> up lrep slots as well as bgworker slots, and I'm going to resist
>>> any such idea strongly.  That way madness lies, or at least an
>>> unreliable postmaster --- if we need lrep slots, what other forty-seven
>>> data structures are we going to need the postmaster to clean up
>>> in the future?
>>>
>>> I haven't studied the code enough to know why it grew lrep worker
>>> slots in the first place.  Maybe someone could explain?
>>>
>>
>> I am not quite sure I understand this question, we need to store
>> additional info about workers in shared memory so we need slots for that.
>>
> 
> Yeah, but you could have used the way we do for parallel query where
> we setup dsm and share all such information.  You can check the logic
> of execparallel.c and parallel.c to see how we do all such stuff for
> parallel query.
> 

I don't understand how DSM helps in this case (except perhaps the memory
allocation being dynamic rather than static). We still need this shared
memory area to be accessible from other places than launcher (the
paralllel query has one lead which manages everything, that's not true
for logical replication) and we need it to survive restart of launcher
for example, so all the current issues will stay.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Amit Kapila

Дата:

19 сентября 2017 г., 19:08:33

On Tue, Sep 19, 2017 at 6:29 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 19/09/17 14:33, Amit Kapila wrote:
>> On Tue, Sep 19, 2017 at 3:34 PM, Petr Jelinek
>> <petr.jelinek@2ndquadrant.com> wrote:
>>> n 18/09/17 18:42, Tom Lane wrote:
>>>
>>>> So, frankly, I think we would be best off losing the "logical rep
>>>> worker slot" business altogether, and making do with just bgworker
>>>> slots.
>>
>> I think that would be cleaner as compared to what we have now.
>>
>>>>  The alternative is getting the postmaster involved in cleaning
>>>> up lrep slots as well as bgworker slots, and I'm going to resist
>>>> any such idea strongly.  That way madness lies, or at least an
>>>> unreliable postmaster --- if we need lrep slots, what other forty-seven
>>>> data structures are we going to need the postmaster to clean up
>>>> in the future?
>>>>
>>>> I haven't studied the code enough to know why it grew lrep worker
>>>> slots in the first place.  Maybe someone could explain?
>>>>
>>>
>>> I am not quite sure I understand this question, we need to store
>>> additional info about workers in shared memory so we need slots for that.
>>>
>>
>> Yeah, but you could have used the way we do for parallel query where
>> we setup dsm and share all such information.  You can check the logic
>> of execparallel.c and parallel.c to see how we do all such stuff for
>> parallel query.
>>
>
> I don't understand how DSM helps in this case (except perhaps the memory
> allocation being dynamic rather than static). We still need this shared
> memory area to be accessible from other places than launcher (the
> paralllel query has one lead which manages everything, that's not true
> for logical replication)
>

I am not much aware of this area.  Can you explain what other usages
it has apart from in the process that has launched the worker and in
worker itself?

> and we need it to survive restart of launcher
> for example, so all the current issues will stay.
>

Do you mean to say that you want to save this part of shared memory
across restart of launcher?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Petr Jelinek

Дата:

19 сентября 2017 г., 19:21:56

On 19/09/17 15:08, Amit Kapila wrote:
> On Tue, Sep 19, 2017 at 6:29 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> On 19/09/17 14:33, Amit Kapila wrote:
>>> On Tue, Sep 19, 2017 at 3:34 PM, Petr Jelinek
>>> <petr.jelinek@2ndquadrant.com> wrote:
>>>> n 18/09/17 18:42, Tom Lane wrote:
>>>>
>>>>> So, frankly, I think we would be best off losing the "logical rep
>>>>> worker slot" business altogether, and making do with just bgworker
>>>>> slots.
>>>
>>> I think that would be cleaner as compared to what we have now.
>>>
>>>>>  The alternative is getting the postmaster involved in cleaning
>>>>> up lrep slots as well as bgworker slots, and I'm going to resist
>>>>> any such idea strongly.  That way madness lies, or at least an
>>>>> unreliable postmaster --- if we need lrep slots, what other forty-seven
>>>>> data structures are we going to need the postmaster to clean up
>>>>> in the future?
>>>>>
>>>>> I haven't studied the code enough to know why it grew lrep worker
>>>>> slots in the first place.  Maybe someone could explain?
>>>>>
>>>>
>>>> I am not quite sure I understand this question, we need to store
>>>> additional info about workers in shared memory so we need slots for that.
>>>>
>>>
>>> Yeah, but you could have used the way we do for parallel query where
>>> we setup dsm and share all such information.  You can check the logic
>>> of execparallel.c and parallel.c to see how we do all such stuff for
>>> parallel query.
>>>
>>
>> I don't understand how DSM helps in this case (except perhaps the memory
>> allocation being dynamic rather than static). We still need this shared
>> memory area to be accessible from other places than launcher (the
>> paralllel query has one lead which manages everything, that's not true
>> for logical replication)
>>
> 
> I am not much aware of this area.  Can you explain what other usages
> it has apart from in the process that has launched the worker and in
> worker itself?
> 

The subscription "DDL" commands and monitoring functions need access to
that info. Note that the synchronization workers are not even started by
launcher (I experimented with doing that but it slows the process of
synchronization quite considerably) so it can't manage them unless the
handle is accessible via IPC.

>> and we need it to survive restart of launcher
>> for example, so all the current issues will stay.
>>
> 
> Do you mean to say that you want to save this part of shared memory
> across restart of launcher?
> 

Yes. There is no reason why replication should stop because of launcher
restart.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Amit Kapila

Дата:

19 сентября 2017 г., 20:30:42

On Tue, Sep 19, 2017 at 3:34 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> n 18/09/17 18:42, Tom Lane wrote:
>> Amit Kapila <amit.kapila16@gmail.com> writes:
>>> On Mon, Sep 18, 2017 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> The subscriber log includes
>>>> 2017-09-18 08:43:08.240 UTC [15672] WARNING:  out of background worker slots
>>>> Maybe that's harmless, but I'm suspicious that it's a smoking gun.
>>
>>> I have noticed while fixing the issue you are talking that this path
>>> is also susceptible to such problems.  In
>>> WaitForReplicationWorkerAttach(), it relies on
>>> GetBackgroundWorkerPid() to know the status of the worker which won't
>>> give the correct status in case of fork failure.  The patch I have
>>> posted has a fix for the issue,
>>
>> Your GetBackgroundWorkerPid fix looks good as far as it goes, but
>> I feel that WaitForReplicationWorkerAttach's problems go way deeper
>> than that --- in fact, that function should not exist at all IMO.
>>
>> The way that launcher.c is set up, it's relying on either the calling
>> process or the called process to free the logicalrep slot when done.
>> The called process might never exist at all, so the second half of
>> that is obviously unreliable; but WaitForReplicationWorkerAttach
>> can hardly be relied on either, seeing it's got a big fat exit-on-
>> SIGINT in it.  I thought about putting a PG_TRY, or more likely
>> PG_ENSURE_ERROR_CLEANUP, into it, but that doesn't fix the basic
>> problem: you can't assume that the worker isn't ever going to start,
>> just because you got a signal that means you shouldn't wait anymore.
>>
>> Now, this rickety business might be fine if it were only about the
>> states of the caller and called processes, but we've got long-lived
>> shared data involved (ie the worker slots); failing to clean it up
>> is not an acceptable outcome.
>>
>
> We'll clean it up eventually if somebody requests creation of new
> logical replication worker and that slot is needed. See the "garbage
> collection" part in logicalrep_worker_launch(). I know it's not ideal
> solution, but it's the best one I could think of given the bellow.
>
>> So, frankly, I think we would be best off losing the "logical rep
>> worker slot" business altogether, and making do with just bgworker
>> slots.  The alternative is getting the postmaster involved in cleaning
>> up lrep slots as well as bgworker slots, and I'm going to resist
>> any such idea strongly.  That way madness lies, or at least an
>> unreliable postmaster --- if we need lrep slots, what other forty-seven
>> data structures are we going to need the postmaster to clean up
>> in the future?
>>
>> I haven't studied the code enough to know why it grew lrep worker
>> slots in the first place.  Maybe someone could explain?
>>
>
> I am not quite sure I understand this question, we need to store
> additional info about workers in shared memory so we need slots for that.
>
> If you are asking why they are not identified by the
> BackgroundWorkerHandle, then it's because it's private struct and can't
> be shared with other processes so there is no way to link the logical
> worker info with bgworker directly.
>

Not sure, but can't we identify the actual worker slot if we just have
pid of background worker?  IIUC, you need access to
BackgroundWorkerHandle by other processes, so that you can stop them
like in case of drop subscription command.  If so, then, might be
storing pid can serve the purpose.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Amit Kapila

Дата:

19 сентября 2017 г., 20:32:59

On Tue, Sep 19, 2017 at 6:51 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> On 19/09/17 15:08, Amit Kapila wrote:
>>
>> I am not much aware of this area.  Can you explain what other usages
>> it has apart from in the process that has launched the worker and in
>> worker itself?
>>
>
> The subscription "DDL" commands and monitoring functions need access to
> that info.
>

Got your point, but still not sure if we need to maintain additional
information to replicate something similar to bgworker machinery.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Petr Jelinek

Дата:

19 сентября 2017 г., 22:15:50

On 19/09/17 16:30, Amit Kapila wrote:
> On Tue, Sep 19, 2017 at 3:34 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> n 18/09/17 18:42, Tom Lane wrote:
>>> Amit Kapila <amit.kapila16@gmail.com> writes:
>>>> On Mon, Sep 18, 2017 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> So, frankly, I think we would be best off losing the "logical rep
>>> worker slot" business altogether, and making do with just bgworker
>>> slots.  The alternative is getting the postmaster involved in cleaning
>>> up lrep slots as well as bgworker slots, and I'm going to resist
>>> any such idea strongly.  That way madness lies, or at least an
>>> unreliable postmaster --- if we need lrep slots, what other forty-seven
>>> data structures are we going to need the postmaster to clean up
>>> in the future?
>>>
>>> I haven't studied the code enough to know why it grew lrep worker
>>> slots in the first place.  Maybe someone could explain?
>>>
>>
>> I am not quite sure I understand this question, we need to store
>> additional info about workers in shared memory so we need slots for that.
>>
>> If you are asking why they are not identified by the
>> BackgroundWorkerHandle, then it's because it's private struct and can't
>> be shared with other processes so there is no way to link the logical
>> worker info with bgworker directly.
>>
> 
> Not sure, but can't we identify the actual worker slot if we just have
> pid of background worker?  IIUC, you need access to
> BackgroundWorkerHandle by other processes, so that you can stop them
> like in case of drop subscription command.  If so, then, might be
> storing pid can serve the purpose.
> 

I don't think that pid can be trusted to belong to same process between
the calls, if we could we would not need BackgroundWorkerHandle.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Thomas Munro

Дата:

20 сентября 2017 г., 03:16:13

On Mon, Sep 18, 2017 at 10:18 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-09-18 21:57:04 +1200, Thomas Munro wrote:
>>     WARNING:  terminating connection because of crash of another server process
>>     DETAIL:  The postmaster has commanded this server process to roll
>> back the current transaction and exit, because another server process
>> exited abnormally and possibly corrupted shared memory.
>>     HINT:  In a moment you should be able to reconnect to the database
>> and repeat your command.
>>
>> As far as I know these are misleading, it's really just an immediate
>> shutdown.  There is no core file, so I don't believe anything actually
>> crashed.
>
> I was about to complain about these, for entirely unrelated reasons. I
> think it's a bad idea - and there's a couple complains on the lists too,
> to emit these warnings.  It's not entirely trivial to fix though :(

This type of violent shutdown seems to be associated with occasional
corruption of .gcda files (the files output by GCC coverage builds).
The symptoms are that if you use --enable-coverage and make
check-world you'll very occasionally get a spurious TAP test failure
like this:

#   Failed test 'pg_ctl start: no stderr'
#   at /home/travis/build/postgresql-cfbot/postgresql/src/bin/pg_ctl/../../../src/test/perl/TestLib.pm
line 301.
#          got:
'profiling:/home/travis/build/postgresql-cfbot/postgresql/src/backend/nodes/copyfuncs.gcda:Merge
mismatch for function 94
# '
#     expected: ''

I'm not sure of the exact mechanism though.  GCC supplies a function
__gcov_flush() that normally runs at exit or execve, so if you're
killed without reaching those you don't get any .gcda data.  Perhaps
we are in exit (or fork/exec) partway through writing out coverage
data in __gcov_flush(), and at that moment we are killed.  Then a
subsequent run of instrumented code will find the half-written file
and print the "Merge mismatch" message.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging on particular environment

От

Tom Lane

Дата:

20 сентября 2017 г., 03:20:49

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> This type of violent shutdown seems to be associated with occasional
> corruption of .gcda files (the files output by GCC coverage builds).
> The symptoms are that if you use --enable-coverage and make
> check-world you'll very occasionally get a spurious TAP test failure
> like this:

> #   Failed test 'pg_ctl start: no stderr'
> #   at /home/travis/build/postgresql-cfbot/postgresql/src/bin/pg_ctl/../../../src/test/perl/TestLib.pm
> line 301.
> #          got:
> 'profiling:/home/travis/build/postgresql-cfbot/postgresql/src/backend/nodes/copyfuncs.gcda:Merge
> mismatch for function 94
> # '
> #     expected: ''

> I'm not sure of the exact mechanism though.  GCC supplies a function
> __gcov_flush() that normally runs at exit or execve, so if you're
> killed without reaching those you don't get any .gcda data.  Perhaps
> we are in exit (or fork/exec) partway through writing out coverage
> data in __gcov_flush(), and at that moment we are killed.  Then a
> subsequent run of instrumented code will find the half-written file
> and print the "Merge mismatch" message.

On a slow/loaded machine, perhaps it could be that the postmaster loses
patience and SIGKILLs a backend that's still writing its .gcda data?
If so, maybe we could make SIGKILL_CHILDREN_AFTER_SECS longer in
coverage builds?  Or bite the bullet and make it configurable ...
        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Andres Freund

Дата:

20 сентября 2017 г., 05:38:30

On 2017-09-19 17:20:49 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > This type of violent shutdown seems to be associated with occasional
> > corruption of .gcda files (the files output by GCC coverage builds).
> > The symptoms are that if you use --enable-coverage and make
> > check-world you'll very occasionally get a spurious TAP test failure
> > like this:
> 
> > #   Failed test 'pg_ctl start: no stderr'
> > #   at /home/travis/build/postgresql-cfbot/postgresql/src/bin/pg_ctl/../../../src/test/perl/TestLib.pm
> > line 301.
> > #          got:
> > 'profiling:/home/travis/build/postgresql-cfbot/postgresql/src/backend/nodes/copyfuncs.gcda:Merge
> > mismatch for function 94
> > # '
> > #     expected: ''
> 
> > I'm not sure of the exact mechanism though.  GCC supplies a function
> > __gcov_flush() that normally runs at exit or execve, so if you're
> > killed without reaching those you don't get any .gcda data.  Perhaps
> > we are in exit (or fork/exec) partway through writing out coverage
> > data in __gcov_flush(), and at that moment we are killed.  Then a
> > subsequent run of instrumented code will find the half-written file
> > and print the "Merge mismatch" message.

Note that newer gcc's (7+) have a feature to avoid such corruption, by
renaming the files atomically. Possibly the fix here is to just upgrade
to such a version...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Craig Ringer

Дата:

20 сентября 2017 г., 09:03:56

On 19 September 2017 at 18:04, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

If you are asking why they are not identified by the
BackgroundWorkerHandle, then it's because it's private struct and can't
be shared with other processes so there is no way to link the logical
worker info with bgworker directly.

I really want BackgroundWorkerHandle to be public, strong +1 from me.

It'd be very beneficial when working with shm_mq, too. Right now you cannot pass a BackgroundWorkerHandle through shmem to another process, either via a static shmem region or via shm_mq. This means you can't supply it to shm_mq_attach and have shm_mq handle lifetime issues for you based on the worker handle.

TBH I think there's a fair bit of work needed in the bgworker infrastructure, but making BackgroundWorkerHandle public is a small and simple start that'd be a big help.

At some point we'll also want to be able to enumerate background workers and get handles for existing workers. Also, let background workers recover from errors without exiting, which means factoring a bunch of stuff out of PostgresMain. But both of those are bigger jobs.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Craig Ringer

Дата:

20 сентября 2017 г., 09:10:41

On 19 September 2017 at 20:33, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, but you could have used the way we do for parallel query where
we setup dsm and share all such information. You can check the logic
of execparallel.c and parallel.c to see how we do all such stuff for
parallel query.

Parallel query has a very clearly scoped lifetime, which seems to help. The parallel workers are started by a leader, whose lifetime entirely encompasses that of the workers. They're strictly children of the leader process.

In logical replication, the logical replication manager doesn't start the walsenders, they're started by the postmaster in response to inbound connections.

But the logical replication launcher does start the managers, and the managers start the apply workers. (IIRC). If we don't mind the whole thing dying and restarting if the launcher dies, or workers for a db dying if a database manager dies, then using dsm segments and detach notifications does seem viable.

IIRC when we did something similar in pglogical we ran into problems with (a) inability to handle an ERROR in a bgworker without exiting and being restarted by the postmaster; and (b) the postmaster remembering bgworker registrations across crash restart with no way to tell it not to. Maybe Petr remembers the details?

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging on particular environment

От

Tom Lane

Дата:

20 сентября 2017 г., 09:53:53

Craig Ringer <craig@2ndquadrant.com> writes:
> On 19 September 2017 at 18:04, Petr Jelinek <petr.jelinek@2ndquadrant.com>
> wrote:
>> If you are asking why they are not identified by the
>> BackgroundWorkerHandle, then it's because it's private struct and can't
>> be shared with other processes so there is no way to link the logical
>> worker info with bgworker directly.

> I really want BackgroundWorkerHandle to be public, strong +1 from me.

I'm confused about what you think that would accomplish.  AFAICS, the
point of BackgroundWorkerHandle is to allow the creator/requestor of
a bgworker to verify whether or not the slot in question is still
"owned" by its request.  This is necessarily not useful to any other
process, since they didn't make the request.

The thought I had in mind upthread was to get rid of logicalrep slots
in favor of expanding the underlying bgworker slot with some additional
fields that would carry whatever extra info we need about a logicalrep
worker.  Such fields could also be repurposed for additional info about
other kinds of bgworkers, when (not if) we find out we need that.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Amit Kapila

Дата:

20 сентября 2017 г., 10:06:50

On Wed, Sep 20, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> On 19 September 2017 at 18:04, Petr Jelinek <petr.jelinek@2ndquadrant.com>
>> wrote:
>>> If you are asking why they are not identified by the
>>> BackgroundWorkerHandle, then it's because it's private struct and can't
>>> be shared with other processes so there is no way to link the logical
>>> worker info with bgworker directly.
>
>> I really want BackgroundWorkerHandle to be public, strong +1 from me.
>
> I'm confused about what you think that would accomplish.  AFAICS, the
> point of BackgroundWorkerHandle is to allow the creator/requestor of
> a bgworker to verify whether or not the slot in question is still
> "owned" by its request.
>

Right, but can we avoid maintaining additional information (in_use,
generation,..) in LogicalRepWorker which is similar to bgworker worker
machinery (which in turn can also avoid all the housekeeping for those
variables) if we have access to BackgroundWorkerHandle?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Craig Ringer

Дата:

20 сентября 2017 г., 10:16:23

On 20 September 2017 at 11:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Ringer <craig@2ndquadrant.com> writes:
> On 19 September 2017 at 18:04, Petr Jelinek <petr.jelinek@2ndquadrant.com>
> wrote:
>> If you are asking why they are not identified by the
>> BackgroundWorkerHandle, then it's because it's private struct and can't
>> be shared with other processes so there is no way to link the logical
>> worker info with bgworker directly.

> I really want BackgroundWorkerHandle to be public, strong +1 from me.

I'm confused about what you think that would accomplish. AFAICS, the
point of BackgroundWorkerHandle is to allow the creator/requestor of
a bgworker to verify whether or not the slot in question is still
"owned" by its request. This is necessarily not useful to any other
process, since they didn't make the request.

I'm using shm_mq in a sort of "connection accepter" role, where a pool of shm_mq's are attached to by a long lived bgworker. Other backends, both bgworkers and normal user backends, can find a free slot and attach to it to talk to the long lived bgworker. These other backends are not necessarily started by the long lived worker, so it doesn't have a BackgroundWorkerHandle for them.

Currently, if the long lived bgworker dies and a peer attempts to attach to the slot, they'll hang forever in shm_mq_wait_internal, because it cannot use shm_mq_set_handle as described in https://www.postgresql.org/message-id/E1XbwOs-0002Fd-H9%40gemulon.postgresql.org to protect its self.

If the BackgroundWorkerHandle for the long-lived bgworker is copied to a small static control shmem segment, the connecting workers can use that to reliably bail out if the long-lived worker dies.

The thought I had in mind upthread was to get rid of logicalrep slots
in favor of expanding the underlying bgworker slot with some additional
fields that would carry whatever extra info we need about a logicalrep
worker. Such fields could also be repurposed for additional info about
other kinds of bgworkers, when (not if) we find out we need that.

That sounds OK to me personally for in-core logical rep, but it's really Petr and Peter who need to have a say here, not me.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Craig Ringer

Дата:

20 сентября 2017 г., 10:52:17

On 20 September 2017 at 12:06, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 20, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> On 19 September 2017 at 18:04, Petr Jelinek <petr.jelinek@2ndquadrant.com>
>> wrote:
>>> If you are asking why they are not identified by the
>>> BackgroundWorkerHandle, then it's because it's private struct and can't
>>> be shared with other processes so there is no way to link the logical
>>> worker info with bgworker directly.
>
>> I really want BackgroundWorkerHandle to be public, strong +1 from me.
>
> I'm confused about what you think that would accomplish. AFAICS, the
> point of BackgroundWorkerHandle is to allow the creator/requestor of
> a bgworker to verify whether or not the slot in question is still
> "owned" by its request.
>

Right, but can we avoid maintaining additional information (in_use,
generation,..) in LogicalRepWorker which is similar to bgworker worker
machinery (which in turn can also avoid all the housekeeping for those
variables) if we have access to BackgroundWorkerHandle?

As far as I can see, probably not.

Because struct BackgroundWorker only has a single by-value Datum argument, you can't pass much more information to a bgworker than "here's a slot number to look up what you need to know to configure yourself". That can be a ToC position for a shm_mq, or some kind of shmem array, but you need something. You can't pass a pointer because of ASLR and EXEC_BACKEND. If the launcher lives significantly longer than its workers, where a worker dying isn't fatal to the launcher, it needs to be able to re-use slots in the fixed-size shmem array workers use to self-configure. Which means some kind of generation counter, like we have in BackgroundWorkerHandle, and an in_use flag.

Knowing a logical worker's bgworker has started isn't enough. WaitForReplicationWorkerAttach needs to know it has come up successfully as a logical replication worker. To do that, it needs to know that the entry it's looking at in LogicalRepWorker isn't for some prior logical replication worker that previously had the same LogicalRepWorker slot, though not necessarily the same BackgroundWorker slot. There's no 1:1 mapping of LogicalRepWorker slot to BackgroundWorker slot to rely on.

So unless we're willing to extend struct BackgroundWorker with some kind of union that can be used to store extra info for logical rep workers and whatever else we might later want, I don't see LogicalRepWorker going away. If we do that, it'll make extending the shmem for logical replication workers harder since it'll grow the entry for every background worker, not just logical replication workers.

Parallel query gets away without most of this because as far as I can see parallel workers don't need to enumerate each other or look up each others' state, which logical rep can need to do for things like initial table sync. It doesn't appear to re-launch workers unless it has a clean slate and can clobber the entire parallel DSM context, either. So it doesn't need to worry about whether the worker it's talking to is the one it just launched, or some old worker that hasn't gone away yet. The DSM segment acts as a generation counter of sorts, with all workers in the same generation.

If we wanted to do away with LogicalRepWorker shmem slots, I suspect we'd have to broker all inter-worker communications via shm_mq pairs, where worker A asks the launcher for the status of worker B, or to block until worker B does X, or whatever. Do everything over shm_mq messaging not shmem variables. That's a pretty major change, and makes the launcher responsible for all synchronisation and IPC. It also wouldn't work properly unless we nuked the DSM segment containing all the queues whenever any worker died and restarted the whole lot, which would be terribly costly in terms of restarting transaction replays etc. Or we'd have to keep some kind of per-shm_mq-pair "in use" flag so we knew when to nuke and reset the queue pair for a new worker to connect, at which point we're halfway back to where we started...

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Craig Ringer

Дата:

20 сентября 2017 г., 11:01:08

On 20 September 2017 at 12:16, Craig Ringer <craig@2ndquadrant.com> wrote:

The thought I had in mind upthread was to get rid of logicalrep slots
in favor of expanding the underlying bgworker slot with some additional
fields that would carry whatever extra info we need about a logicalrep
worker. Such fields could also be repurposed for additional info about
other kinds of bgworkers, when (not if) we find out we need that.

That sounds OK to me personally for in-core logical rep, but it's really Petr and Peter who need to have a say here, not me.

Actually, I take that back. It'd bloat struct BackgroundWorker significantly, and lead to arguments whenever logical replication needed new fields, which it surely will. Every bgworker would pay the price. If we added some kind of union to struct BackgroundWorker, other worker types could later use the same space, offsetting the cost somewhat. But that'd be no use to out-of-core workers, workers that don't actually need the extra room, etc.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Alvaro Herrera

Дата:

20 сентября 2017 г., 15:50:46

Craig Ringer wrote:

> IIRC when we did something similar in pglogical we ran into problems with
> (a) inability to handle an ERROR in a bgworker without exiting and being
> restarted by the postmaster;

I don't understand why this would be; surely you can just setup another
longjmp block for elog(ERROR) to jump to?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

От

Petr Jelinek

Дата:

20 сентября 2017 г., 16:59:41

On 20/09/17 05:53, Tom Lane wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> On 19 September 2017 at 18:04, Petr Jelinek <petr.jelinek@2ndquadrant.com>
>> wrote:
>>> If you are asking why they are not identified by the
>>> BackgroundWorkerHandle, then it's because it's private struct and can't
>>> be shared with other processes so there is no way to link the logical
>>> worker info with bgworker directly.
> 
>> I really want BackgroundWorkerHandle to be public, strong +1 from me.
> 
> I'm confused about what you think that would accomplish.  AFAICS, the
> point of BackgroundWorkerHandle is to allow the creator/requestor of
> a bgworker to verify whether or not the slot in question is still
> "owned" by its request.  This is necessarily not useful to any other
> process, since they didn't make the request.

I works pretty well as unique identifier of the bgworker which can be
used to check if the specific bgworker process is the one we think it is
and that would work even across processes if it could be shared.
Bgworker slots don't help with that as they can be replaced by different
worker between calls, same goes for PID. It technically does not have to
be public struct, just serialize-able/copyable so knowing size of the
struct ought to be enough for that.

> The thought I had in mind upthread was to get rid of logicalrep slots
> in favor of expanding the underlying bgworker slot with some additional
> fields that would carry whatever extra info we need about a logicalrep
> worker.  Such fields could also be repurposed for additional info about
> other kinds of bgworkers, when (not if) we find out we need that.
> 

Well, that would be nice but those slots are in shared memory so it's
not exactly easy to allow other modules to add arbitrary data there.
Maybe we could add just one field that holds DSM handle (or DSA handle)
to whatever dynamic shared memory is allocated for that slot. I don't
know enough about inner working of DSM to say for sure if that's
feasible or not though.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [HACKERS] src/test/subscription/t/002_types.pl hanging on particular environment