Обсуждение: Minor postmaster state machine bugs

Поиск
Список
Период
Сортировка

Minor postmaster state machine bugs

От
Tom Lane
Дата:
In pursuit of the problem with standby servers sometimes not responding
to fast shutdowns [1], I spent awhile staring at the postmaster's
state-machine logic.  I have not found a cause for that problem,
but I have identified some other things that seem like bugs:


1. sigusr1_handler ignores PMSIGNAL_ADVANCE_STATE_MACHINE unless the
current state is PM_WAIT_BACKUP or PM_WAIT_BACKENDS.  This restriction
seems useless and shortsighted: PostmasterStateMachine should behave
sanely regardless of our state, and sigusr1_handler really has no
business assuming anything about why a child is asking for a state
machine reconsideration.  But it's not just not future-proof, it's a
live bug even for the one existing use-case, which is that a new
walsender sends this signal after it's re-marked itself as being a
walsender rather than a normal backend.  Consider this sequence of
events:
* System is running as a hot standby and allowing cascaded replication.
There are no live backends.
* New replication connection request is received and forked off.
(At this point the postmaster thinks this child is a normal session
backend.)
* SIGTERM (Smart Shutdown) is received.  Postmaster will transition
to PM_WAIT_READONLY.  I don't think it would have autovac or bgworker
or bgwriter or walwriter children, but if so, assume they all exit
before the next step.  Postmaster will continue to sleep, waiting for
its one "normal" child backend to finish.
* Replication connection request completes, so child re-marks itself
as a walsender and sends PMSIGNAL_ADVANCE_STATE_MACHINE.
* Postmaster ignores signal because it's in the "wrong" state, so it
doesn't realize it now has no normal backend children.
* Postmaster waits forever, or at least till DBA loses patience and
sends a stronger signal.

This scenario doesn't explain the buildfarm failures since those don't
involve smart shutdowns (and I think they don't involve cascaded
replication either).  Still, it's clearly a bug, which I think
we should fix by removing the pointless restriction on whether
PostmasterStateMachine can be called.

Also, I'm inclined to think that that should be the *last* step in
sigusr1_handler, not randomly somewhere in the middle.  As coded,
it's basically assuming that no later action in sigusr1_handler
could affect anything that PostmasterStateMachine cares about, which
even if it's true today is another highly not-future-proof assumption.


2. MaybeStartWalReceiver will clear the WalReceiverRequested flag
even if it fails to launch a child process for some reason.  This
is just dumb; it should leave the flag set so that we'll try again
next time through the postmaster's idle loop.


3. PostmasterStateMachine's handling of PM_SHUTDOWN_2 is:

    if (pmState == PM_SHUTDOWN_2)
    {
        /*
         * PM_SHUTDOWN_2 state ends when there's no other children than
         * dead_end children left. There shouldn't be any regular backends
         * left by now anyway; what we're really waiting for is walsenders and
         * archiver.
         *
         * Walreceiver should normally be dead by now, but not when a fast
         * shutdown is performed during recovery.
         */
        if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0 &&
            WalReceiverPID == 0)
        {
            pmState = PM_WAIT_DEAD_END;
        }
    }

The comment about walreceivers is confusing, and it's also wrong.  Maybe
it was valid when written, but today it's easy to trace the logic and see
that we can only get to PM_SHUTDOWN_2 state from PM_SHUTDOWN state, and
we can only get to PM_SHUTDOWN state when there is no live walreceiver
(cf processing of PM_WAIT_BACKENDS state), and we won't attempt to launch
a new walreceiver while in PM_SHUTDOWN or PM_SHUTDOWN_2 state, so it's
impossible for there to be any walreceiver here.  I think we should just
remove that comment and the WalReceiverPID == 0 test.


Comments?  I think at least the first two points need to be back-patched.

            regards, tom lane

[1] https://www.postgresql.org/message-id/20190416070119.GK2673@paquier.xyz



How and at what stage to stop FDW to generate plan with JOIN.

От
Ibrar Ahmed
Дата:

Hi,

I am working on an FDW where the database does not support any operator other than "=" in JOIN condition. Some queries are genrating the plan with JOIN having "<" operator. How and at what stage I can stop FDW to not make such a plan. Here is my sample query.



tpch=# select

    l_orderkey,

    sum(l_extendedprice * (1 - l_discount)) as revenue,

    o_orderdate,

    o_shippriority

from

    customer,

    orders,

    lineitem

where

    c_mktsegment = 'BUILDING'

    and c_custkey = o_custkey

    and l_orderkey = o_orderkey

    and o_orderdate < date '1995-03-22'

    and l_shipdate > date '1995-03-22'

group by

    l_orderkey,

    o_orderdate,

    o_shippriority

order by

    revenue,

    o_orderdate

LIMIT 10; 



       QUERY PLAN                                                                                                        

...

Merge Cond: (orders.o_orderkey = lineitem.l_orderkey)

->  Foreign Scan  (cost=1.00..-1.00 rows=1000 width=50)

Output: orders.o_orderdate, orders.o_shippriority, orders.o_orderkey

Relations: (customer) INNER JOIN (orders)

Remote SQL: SELECT r2.o_orderdate, r2.o_shippriority, r2.o_orderkey FROM  db.customer r1 ALL INNER JOIN db.orders r2 ON (((r1.c_custkey = r2.o_custkey)) AND ((r2.o_orderdate < '1995-03-22')) AND ((r1.c_mktsegment = 'BUILDING'))) ORDER BY r2.o_orderkey, r2.o_orderdate, r2.o_shippriority

...


--

Ibrar Ahmed



Re: How and at what stage to stop FDW to generate plan with JOIN.

От
Tom Lane
Дата:
Ibrar Ahmed <ibrar.ahmad@gmail.com> writes:
> I am working on an FDW where the database does not support any operator
> other than "=" in JOIN condition. Some queries are genrating the plan with
> JOIN having "<" operator. How and at what stage I can stop FDW to not make
> such a plan. Here is my sample query.

What exactly do you think should happen instead?  You can't just tell
users not to ask such a query.  (Well, you can try, but they'll probably
go looking for a less broken FDW.)

If what you really mean is you don't want to generate pushed-down
foreign join paths containing non-equality conditions, the answer is
to just not do that.  That'd be the FDW's own fault, not that of
the core planner, if it creates a path representing a join it
can't actually implement.  You'll end up running the join locally,
which might not be great, but if you have no other alternative
then that's what you gotta do.

If what you mean is you don't know how to inspect the join quals
to see if you can implement them, take a look at postgres_fdw
to see how it handles the same issue.

            regards, tom lane



Re: How and at what stage to stop FDW to generate plan with JOIN.

От
Ibrar Ahmed
Дата:


On Wed, Apr 24, 2019 at 1:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ibrar Ahmed <ibrar.ahmad@gmail.com> writes:
> I am working on an FDW where the database does not support any operator
> other than "=" in JOIN condition. Some queries are genrating the plan with
> JOIN having "<" operator. How and at what stage I can stop FDW to not make
> such a plan. Here is my sample query.

What exactly do you think should happen instead?  You can't just tell
users not to ask such a query.  (Well, you can try, but they'll probably
go looking for a less broken FDW.)

I know that.
 
If what you really mean is you don't want to generate pushed-down
foreign join paths containing non-equality conditions, the answer is
to just not do that.  That'd be the FDW's own fault, not that of
the core planner, if it creates a path representing a join it
can't actually implement.  You'll end up running the join locally,
which might not be great, but if you have no other alternative
then that's what you gotta do.

Yes, that's what I am thinking. In case of non-equality condition join them locally is
the only solution. I was just confirming.
 
If what you mean is you don't know how to inspect the join quals
to see if you can implement them, take a look at postgres_fdw
to see how it handles the same issue.

I really don't know postgres_fdw have the same issue, but yes postgres_fdw 
is always my starting point. 
 
                        regards, tom lane


--
Ibrar Ahmed