Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment
Дата
Msg-id CAMsr+YF2TkbifMwxCE4DxMtTN=B5M_uHBLehBW6C4o_xu9yWKQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On 20 September 2017 at 12:06, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Sep 20, 2017 at 9:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Craig Ringer <craig@2ndquadrant.com> writes:
>> On 19 September 2017 at 18:04, Petr Jelinek <petr.jelinek@2ndquadrant.com>
>> wrote:
>>> If you are asking why they are not identified by the
>>> BackgroundWorkerHandle, then it's because it's private struct and can't
>>> be shared with other processes so there is no way to link the logical
>>> worker info with bgworker directly.
>
>> I really want BackgroundWorkerHandle to be public, strong +1 from me.
>
> I'm confused about what you think that would accomplish.  AFAICS, the
> point of BackgroundWorkerHandle is to allow the creator/requestor of
> a bgworker to verify whether or not the slot in question is still
> "owned" by its request.
>

Right, but can we avoid maintaining additional information (in_use,
generation,..) in LogicalRepWorker which is similar to bgworker worker
machinery (which in turn can also avoid all the housekeeping for those
variables) if we have access to BackgroundWorkerHandle?

As far as I can see, probably not.

Because struct BackgroundWorker only has a single by-value Datum argument, you can't pass much more information to a bgworker than "here's a slot number to look up what you need to know to configure yourself". That can be a ToC position for a shm_mq, or some kind of shmem array, but you need something. You can't pass a pointer because of ASLR and EXEC_BACKEND. If the launcher lives significantly longer than its workers, where a worker dying isn't fatal to the launcher, it needs to be able to re-use slots in the fixed-size shmem array workers use to self-configure. Which means some kind of generation counter, like we have in BackgroundWorkerHandle, and an in_use flag.

Knowing a logical worker's bgworker has started isn't enough. WaitForReplicationWorkerAttach needs to know it has come up successfully as a logical replication worker. To do that, it needs to know that the entry it's looking at in LogicalRepWorker isn't for some prior logical replication worker that previously had the same LogicalRepWorker slot, though not necessarily the same BackgroundWorker slot. There's no 1:1 mapping of LogicalRepWorker slot to BackgroundWorker slot to rely on.

So unless we're willing to extend struct BackgroundWorker with some kind of union that can be used to store extra info for logical rep workers and whatever else we might later want, I don't see LogicalRepWorker going away. If we do that, it'll make extending the shmem for logical replication workers harder since it'll grow the entry for every background worker, not just logical replication workers.

Parallel query gets away without most of this because as far as I can see parallel workers don't need to enumerate each other or look up each others' state, which logical rep can need to do for things like initial table sync. It doesn't appear to re-launch workers unless it has a clean slate and can clobber the entire parallel DSM context, either. So it doesn't need to worry about whether the worker it's talking to is the one it just launched, or some old worker that hasn't gone away yet. The DSM segment acts as a generation counter of sorts, with all workers in the same generation.


If we wanted to do away with LogicalRepWorker shmem slots, I suspect we'd have to broker all inter-worker communications via shm_mq pairs, where worker A asks the launcher for the status of worker B, or to block until worker B does X, or whatever. Do everything over shm_mq messaging not shmem variables. That's a pretty major change, and makes the launcher responsible for all synchronisation and IPC. It also wouldn't work properly unless we nuked the DSM segment containing all the queues whenever any worker died and restarted the whole lot, which would be terribly costly in terms of restarting transaction replays etc. Or we'd have to keep some kind of per-shm_mq-pair "in use" flag so we knew when to nuke and reset the queue pair for a new worker to connect, at which point we're halfway back to where we started...


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: [HACKERS] "inconsistent page found" with checksum andwal_consistency_checking enabled
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: [HACKERS] src/test/subscription/t/002_types.pl hanging onparticular environment