Обсуждение: [HACKERS] pg_basebackup behavior on non-existent slot

Поиск

Список

Период

Сортировка

[HACKERS] pg_basebackup behavior on non-existent slot

От

Jeff Janes

Дата:

05 сентября 2017 г., 04:21:16

If I tell pg_basebackup to use a non-existent slot, it immediately reports an error. And then it exits with an error, but only after streaming the entire database contents.

If you are doing this interactively and are on the ball, of course, you can hit ctrl-C when you see the error message.

I don't know if this is exactly a bug, but it seems rather unfortunate.

Should the parent process of pg_basebackup be made to respond to SIGCHLD? Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?

$ /usr/local/pgsql9_6/bin/pg_basebackup -D data_replica -P --slot foobar -Xs

pg_basebackup: could not send replication command "START_REPLICATION": ERROR: replication slot "foobar" does not exist

22384213/22384213 kB (100%), 1/1 tablespace

pg_basebackup: child process exited with error 1

pg_basebackup: removing data directory "data_replica"

Cheers,

Jeff

Re: [HACKERS] pg_basebackup behavior on non-existent slot

От

Magnus Hagander

Дата:

06 сентября 2017 г., 05:25:59

On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

If I tell pg_basebackup to use a non-existent slot, it immediately reports an error. And then it exits with an error, but only after streaming the entire database contents.

If you are doing this interactively and are on the ball, of course, you can hit ctrl-C when you see the error message.

I don't know if this is exactly a bug, but it seems rather unfortunate.

I think that should qualify as a bug.

In 10 it will automatically create a transient slot in this case, but there might still be a case where you can provoke this.

Should the parent process of pg_basebackup be made to respond to SIGCHLD? Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?

I think it's ok to just call waitpid() -- we don't need to react super quickly, but we should react. And we should then exit the main process with an error before actually streaming everything.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: [HACKERS] pg_basebackup behavior on non-existent slot

От

Alvaro Herrera

Дата:

06 сентября 2017 г., 15:50:04

Magnus Hagander wrote:
> On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

> > Should the parent process of pg_basebackup be made to respond to SIGCHLD?
> > Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?
> 
> I think it's ok to just call waitpid() -- we don't need to react super
> quickly, but we should react.

Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no?  I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.

> And we should then exit the main process with an error before actually
> streaming everything.

Right.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] pg_basebackup behavior on non-existent slot

От

Magnus Hagander

Дата:

12 сентября 2017 г., 14:40:15

On Wed, Sep 6, 2017 at 11:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Magnus Hagander wrote:
> On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

> > Should the parent process of pg_basebackup be made to respond to SIGCHLD?
> > Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?
>
> I think it's ok to just call waitpid() -- we don't need to react super
> quickly, but we should react.

Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.

Good point.

So the question is what to do for Windows. I'd rather not have to bring in the whole extra thread and socket emulation stuff into pg_basebackup if it can be avoided. But I guess we could code up something Windows-specific in just that one (since it's threaded and not processed on Windows, it's easier than the backend). I think that means we'd have to rewrite it to use the async libpq apis, don't you?

The other option would be to just kill the process from the child thread. Since the're threads we can do that. However, that will leave us in a position where we can't clean up from the error (as in remove files/dirs), not sure that's good?

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: [HACKERS] pg_basebackup behavior on non-existent slot

От

Jeff Janes

Дата:

12 сентября 2017 г., 23:35:58

On Wed, Sep 6, 2017 at 2:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Magnus Hagander wrote:
> On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

> > Should the parent process of pg_basebackup be made to respond to SIGCHLD?
> > Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?
>
> I think it's ok to just call waitpid() -- we don't need to react super
> quickly, but we should react.

Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.

If we don't want polling by waitpid, then my next thought would be to move the data copy into another process, then have the main process do nothing but wait for the first child to exit. If the first to exit is the WAL receiver, then we must have an error and the data receiver can be killed. I don't know how to translate that to Windows, however.

Cheers,

Jeff

Re: [HACKERS] pg_basebackup behavior on non-existent slot

От

Magnus Hagander

Дата:

15 сентября 2017 г., 00:30:20

On Tue, Sep 12, 2017 at 7:35 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Sep 6, 2017 at 2:50 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Magnus Hagander wrote:
> On Mon, Sep 4, 2017 at 3:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

> > Should the parent process of pg_basebackup be made to respond to SIGCHLD?
> > Or call waitpid(bgchild, &status, WNOHANG) in some strategic loop?
>
> I think it's ok to just call waitpid() -- we don't need to react super
> quickly, but we should react.

Hmm, not sure about that ... in the normal case (slotname is correct)
you'd be doing thousands of useless waitpid() system calls during the
whole operation, no? I think it'd be better to have a SIGCHLD handler
that sets a flag (just once), which can be quickly checked without
accessing kernel space.

If we don't want polling by waitpid, then my next thought would be to move the data copy into another process, then have the main process do nothing but wait for the first child to exit. If the first to exit is the WAL receiver, then we must have an error and the data receiver can be killed. I don't know how to translate that to Windows, however.

Well, we could do something similar -- run the main process and the streamer in separate threads on windows and have a main thread wait on both. The main thread would have to be in charge of cleanup as well of course. But I think that's likely going to be more complicated than using non blocking libpq APIs.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [HACKERS] pg_basebackup behavior on non-existent slot

[HACKERS] pg_basebackup behavior on non-existent slot

Re: [HACKERS] pg_basebackup behavior on non-existent slot

Re: [HACKERS] pg_basebackup behavior on non-existent slot

Re: [HACKERS] pg_basebackup behavior on non-existent slot

Re: [HACKERS] pg_basebackup behavior on non-existent slot

Re: [HACKERS] pg_basebackup behavior on non-existent slot