Обсуждение: BUG #13888: pg_dump write error

Поиск
Список
Период
Сортировка

BUG #13888: pg_dump write error

От
kunschikov@gmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      13888
Logged by:          Vladimir Kunschikov
Email address:      kunschikov@gmail.com
PostgreSQL version: 9.4.5
Operating system:   Linux
Description:

Database backup in heavy loaded environment sometimes fails with the
following error:

 pg_dump: [parallel archiver] could not write to output file: Success

Subsequent rerun of the pg_dump solves that problem.
More elegant and reliable solution:  WRITE_ERROR_EXIT macro replacement in
ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file.
There can be some error checking instead of this macro.

Re: BUG #13888: pg_dump write error

От
Alvaro Herrera
Дата:
kunschikov@gmail.com wrote:

> Database backup in heavy loaded environment sometimes fails with the
> following error:
>
>  pg_dump: [parallel archiver] could not write to output file: Success
>
> Subsequent rerun of the pg_dump solves that problem.
> More elegant and reliable solution:  WRITE_ERROR_EXIT macro replacement in
> ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file.
> There can be some error checking instead of this macro.

Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching.  Care to submit a
patch?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13888: pg_dump write error

От
Michael Paquier
Дата:
On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> kunschikov@gmail.com wrote:
>
>> Database backup in heavy loaded environment sometimes fails with the
>> following error:
>>
>>  pg_dump: [parallel archiver] could not write to output file: Success
>>
>> Subsequent rerun of the pg_dump solves that problem.
>> More elegant and reliable solution:  WRITE_ERROR_EXIT macro replacement in
>> ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file.
>> There can be some error checking instead of this macro.
>
> Yeah, I noticed this and similar lacks of error checks in pg_dump in
> code review, which I didn't get around to patching.  Care to submit a
> patch?

Indeed, with a closer look there are things like tarWrite that can
return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
simply check for errno = 0 and generate a more generic error message
instead? Or are you willing at replacing all those things with just
exit_horribly()?
--
Michael

Re: BUG #13888: pg_dump write error

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Yeah, I noticed this and similar lacks of error checks in pg_dump in
>> code review, which I didn't get around to patching.  Care to submit a
>> patch?

> Indeed, with a closer look there are things like tarWrite that can
> return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
> simply check for errno = 0 and generate a more generic error message
> instead? Or are you willing at replacing all those things with just
> exit_horribly()?

I do not understand these claims that there isn't an error check there.
There surely is.  But fwrite() didn't set errno.

The real question is why did he get a short write in the first place.
We don't make any attempt to support filesystems that require retries,
which seems to be what is going on here.  Should we?

            regards, tom lane

Re: BUG #13888: pg_dump write error

От
Vladimir Kunschikov
Дата:
Attached small refactoring of the ahwrite() which includes checking of the errno.
It was made with postgres-9.4.5 sources but applicable to 9.4.1 and even to 9.5.0 with small fuss at ExecuteSqlCommandBuf() call
It could be done  with small check of the errno with rerun without any refactoring.

On Mon, Jan 25, 2016 at 7:08 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
kunschikov@gmail.com wrote:

> Database backup in heavy loaded environment sometimes fails with the
> following error:
>
>  pg_dump: [parallel archiver] could not write to output file: Success
>
> Subsequent rerun of the pg_dump solves that problem.
> More elegant and reliable solution:  WRITE_ERROR_EXIT macro replacement in
> ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file.
> There can be some error checking instead of this macro.

Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching.  Care to submit a
patch?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: BUG #13888: pg_dump write error

От
Michael Paquier
Дата:
On Tue, Jan 26, 2016 at 11:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>>> Yeah, I noticed this and similar lacks of error checks in pg_dump in
>>> code review, which I didn't get around to patching.  Care to submit a
>>> patch?
>
>> Indeed, with a closer look there are things like tarWrite that can
>> return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
>> simply check for errno = 0 and generate a more generic error message
>> instead? Or are you willing at replacing all those things with just
>> exit_horribly()?
>
> I do not understand these claims that there isn't an error check there.
> There surely is.  But fwrite() didn't set errno.

Yep, and the error message provided let's the user think that there
has been a failure, with a success. I am wondering about something
like that:
-#define WRITE_ERROR_EXIT \
+#define WRITE_ERROR_EXIT(errno) \
        do { \
-               exit_horribly(modulename, "could not write to output
file: %s\n", \
-                                         strerror(errno)); \
+               if (errno != 0) \
+                       exit_horribly(modulename, "could not write to
output file: %s\n", \
+                                                 strerror(errno)); \
+               else \
+                       exit_horribly(modulename, "could not write to
output file\n"); \
        } while (0)
A more invasive approach would be to actually replace most of the
READ/WRITE_ERROR_EXIT stuff by more appropriate error checks with
customized error messages. And I would prefer that, this helps having
a deeper understanding of exactly where the error occurred when
running pg_dump.

> The real question is why did he get a short write in the first place.
> We don't make any attempt to support filesystems that require retries,
> which seems to be what is going on here.  Should we?

I would keep the code simple, we don't do that in the backend I recall.
--
Michael

Re: BUG #13888: pg_dump write error

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
> > On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
> > <alvherre@2ndquadrant.com> wrote:
> >> Yeah, I noticed this and similar lacks of error checks in pg_dump in
> >> code review, which I didn't get around to patching.  Care to submit a
> >> patch?
>
> > Indeed, with a closer look there are things like tarWrite that can
> > return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
> > simply check for errno = 0 and generate a more generic error message
> > instead? Or are you willing at replacing all those things with just
> > exit_horribly()?
>
> I do not understand these claims that there isn't an error check there.
> There surely is.  But fwrite() didn't set errno.

Yeah, that's what I was remembering actually:
http://www.postgresql.org/message-id/20150608174336.GM133018@postgresql.org

> The real question is why did he get a short write in the first place.
> We don't make any attempt to support filesystems that require retries,
> which seems to be what is going on here.  Should we?

Sounds likely.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #13888: pg_dump write error

От
Vladimir Kunschikov
Дата:
pg_dump: could not write to output file: Success
>We don't make any attempt to support filesystems that require retries

Filesystem: VFAT on usb media.Seems like short write was caused not by
filesystem but by gzwrite(), since pg_dump was called with `-F d` switch:

pg_dump -U *** -d **** -v -j 2 -t "alert_counters" -t "aggr_alert_counters"
-t "alerts*" -t "payloads*" -F d -f
"/media/usb0/backup_963871436_2016-01-18_09-16-12"

I think there was EAGAIN error. We are solving this problem now by
suggesting to users  rerun of the backup script. I've attached small patch
to the ahwrite() function in my previous posting (it is waiting for the
moderation  approval ).

On Tue, Jan 26, 2016 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Michael Paquier <michael.paquier@gmail.com> writes:
> > On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera
> > <alvherre@2ndquadrant.com> wrote:
> >> Yeah, I noticed this and similar lacks of error checks in pg_dump in
> >> code review, which I didn't get around to patching.  Care to submit a
> >> patch?
>
> > Indeed, with a closer look there are things like tarWrite that can
> > return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we
> > simply check for errno = 0 and generate a more generic error message
> > instead? Or are you willing at replacing all those things with just
> > exit_horribly()?
>
> I do not understand these claims that there isn't an error check there.
> There surely is.  But fwrite() didn't set errno.
>
> The real question is why did he get a short write in the first place.
> We don't make any attempt to support filesystems that require retries,
> which seems to be what is going on here.  Should we?
>
>                         regards, tom lane
>

Re: BUG #13888: pg_dump write error

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> I do not understand these claims that there isn't an error check there.
>> There surely is.  But fwrite() didn't set errno.

> Yeah, that's what I was remembering actually:
> http://www.postgresql.org/message-id/20150608174336.GM133018@postgresql.org

>> The real question is why did he get a short write in the first place.
>> We don't make any attempt to support filesystems that require retries,
>> which seems to be what is going on here.  Should we?

> Sounds likely.

Actually, now that I think about it while not in an airport, what I
think we really need is logic along the lines of

      errno = 0;
      nbytes = fwrite(...);
      if (nbytes != expected && errno == 0)
           errno = ENOSPC;

ie, assume a short write implies out-of-disk-space.  I believe that
is what we do in most (hopefully all) cases in the backend; see for
example UpdateControlFile() in xlog.c.

IMO it is not really WRITE_ERROR_EXIT's place to assume that errno == 0
should be replaced by ENOSPC; in particular, that would be incorrect,
or at least insufficient, unless there's an explicit initialization of
"errno = 0;" before the fwrite call.  So I'm inclined to put the
responsibility to do all of the above logic on the fwrite call sites,
rather than splitting it between callers and WRITE_ERROR_EXIT.

            regards, tom lane

Re: BUG #13888: pg_dump write error

От
Andres Freund
Дата:
On February 2, 2016 10:12:54 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>Actually, now that I think about it while not in an airport, what I
>think we really need is logic along the lines of
>
>      errno = 0;
>      nbytes = fwrite(...);
>      if (nbytes != expected && errno == 0)
>           errno = ENOSPC;
>
>ie, assume a short write implies out-of-disk-space.  I believe that
>is what we do in most (hopefully all) cases in the backend; see for
>example UpdateControlFile() in xlog.c.

There's an exception: when writing WAL we intentionally retry on short writes. IIRC Heikki added that after we found a
casewhere large writes returned short, but non zero, and trying again to finish the rest works. I'm nor sure of there
aren'tother cases where that should be done, die to large writes. 

Andres

---
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: BUG #13888: pg_dump write error

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On February 2, 2016 10:12:54 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ie, assume a short write implies out-of-disk-space.  I believe that
>> is what we do in most (hopefully all) cases in the backend; see for
>> example UpdateControlFile() in xlog.c.

> There's an exception: when writing WAL we intentionally retry on short writes. IIRC Heikki added that after we found
acase where large writes returned short, but non zero, and trying again to finish the rest works. I'm nor sure of there
aren'tother cases where that should be done, die to large writes. 

I could support making pg_dump do that if there's a way to do it when
writing through zlib (which is probably the normal case these days).
I'm not at all sure that gzwrite() is retryable after a partial write;
that would likely have consequences for the internal state of the
compressor.

In any case, if you get nbytes == 0 and errno == 0, substituting ENOSPC
seems like the right thing.

            regards, tom lane