Обсуждение: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

Поиск
Список
Период
Сортировка
Hello, hackers!

Recently I've been building postgres with different cflags and cppflags.
And suddenly on REL_15_STABLE, REL_16_STABLE and master
I faced a failure of a src/test/subscription/t/029_on_error.pl test when
      CPPFLAGS="-DWAL_DEBUG"
and
      printf "wal_debug = on\n" >> "${TEMP_CONFIG}"
(or when both publisher and subscriber or only subscriber are run with wal_debug=on)

So I propose a little fix to the test.


Kind regards,
Ian Ilyasov.

Junior Software Developer at Postgres Professional
Вложения

Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

От
Michael Paquier
Дата:
On Tue, May 14, 2024 at 10:22:29AM +0000, Ilyasov Ian wrote:
> Hello, hackers!
>
> Recently I've been building postgres with different cflags and cppflags.
> And suddenly on REL_15_STABLE, REL_16_STABLE and master
> I faced a failure of a src/test/subscription/t/029_on_error.pl test when
>       CPPFLAGS="-DWAL_DEBUG"
> and
>       printf "wal_debug = on\n" >> "${TEMP_CONFIG}"
> (or when both publisher and subscriber or only subscriber are run with wal_debug=on)
>
> So I propose a little fix to the test.

Rather than assuming that the last line is the one to check, wouldn't
it be better to grab the data from the CONTEXT line located just after
the ERROR reporting the primary key violation?

A multi-line regexp, grabbing the LSN with more matching context based
on the ERROR and the DETAIL strings generating the CONTEXT we want
seems like a more stable alternative to me than grabbing the last line
of the logs.
--
Michael

Вложения
On Wed, May 15, 2024 at 9:26 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, May 14, 2024 at 10:22:29AM +0000, Ilyasov Ian wrote:
> > Hello, hackers!
> >
> > Recently I've been building postgres with different cflags and cppflags.
> > And suddenly on REL_15_STABLE, REL_16_STABLE and master
> > I faced a failure of a src/test/subscription/t/029_on_error.pl test when
> > CPPFLAGS="-DWAL_DEBUG"
> > and
> > printf "wal_debug = on\n" >> "${TEMP_CONFIG}"
> > (or when both publisher and subscriber or only subscriber are run with wal_debug=on)
> >
> > So I propose a little fix to the test.
>
> Rather than assuming that the last line is the one to check, wouldn't
> it be better to grab the data from the CONTEXT line located just after
> the ERROR reporting the primary key violation?
>

I guess it could be more work if we want to enhance the test for
ERRORs other than the primary key violation. One simple fix is to
update the log_offset to the location of the LOG after successful
replication of un-conflicted data. For example, the Log location after
we execute the below line in the test:

# Check replicated data
               my $res =
                 $node_subscriber->safe_psql('postgres', "SELECT
count(*) FROM tbl");
               is($res, $expected, $msg);

--
With Regards,
Amit Kapila.



RE: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Amit, Ian,

> I guess it could be more work if we want to enhance the test for
> ERRORs other than the primary key violation. One simple fix is to
> update the log_offset to the location of the LOG after successful
> replication of un-conflicted data. For example, the Log location after
> we execute the below line in the test:
> 
> # Check replicated data
>                my $res =
>                  $node_subscriber->safe_psql('postgres', "SELECT
> count(*) FROM tbl");
>                is($res, $expected, $msg);

I made a patch for confirmation purpose. This worked well on my environment.
Ian, how about you?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


Вложения
Dear Hayato,

> I made a patch for confirmation purpose. This worked well on my environment.
> Ian, how about you?

I checked this patch on my environment. It also works well.
I like this change, but as I see it makes a different approach from Michael's advice.
Honesly, I do not know what would be better for this test.

Kind regards,
Ian Ilyasov.


Junior Software Developer at Postgres Professional



Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

От
Michael Paquier
Дата:
On Wed, May 15, 2024 at 05:58:18PM +0530, Amit Kapila wrote:
> I guess it could be more work if we want to enhance the test for
> ERRORs other than the primary key violation.

And?  You could pass the ERROR string expected as argument of the
function if more flexibility is wanted at some point, no?  It happens
that you don't require that now, which is fine for the three scenarios
calling test_skip_lsn.

> One simple fix is to
> update the log_offset to the location of the LOG after successful
> replication of un-conflicted data. For example, the Log location after
> we execute the below line in the test:
>
> # Check replicated data
>                my $res =
>                  $node_subscriber->safe_psql('postgres', "SELECT
> count(*) FROM tbl");
>                is($res, $expected, $msg);

That still looks like a shortcut to me, weak to race conditions on
slow machines where more log entries could be generated in-between.
So it seems to me that you'd still want to make sure that the CONTEXT
from which the LSN is retrieved maps to the sole expected error.  This
is not going to be stable unless there are stronger checks to avoid
log entries that can parasite the output, and a stronger matching
ensures that.
--
Michael

Вложения
On Thu, May 16, 2024 at 3:43 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, May 15, 2024 at 05:58:18PM +0530, Amit Kapila wrote:
> > I guess it could be more work if we want to enhance the test for
> > ERRORs other than the primary key violation.
>
> And?  You could pass the ERROR string expected as argument of the
> function if more flexibility is wanted at some point, no?
>

Normally, we consider error_codes for comparison as they are less
prone to change but here it may be okay to use error_string as we can
change it, if required. But let's discuss a bit more on the other
solution being discussed below.

>  It happens
> that you don't require that now, which is fine for the three scenarios
> calling test_skip_lsn.
>
> > One simple fix is to
> > update the log_offset to the location of the LOG after successful
> > replication of un-conflicted data. For example, the Log location after
> > we execute the below line in the test:
> >
> > # Check replicated data
> >                my $res =
> >                  $node_subscriber->safe_psql('postgres', "SELECT
> > count(*) FROM tbl");
> >                is($res, $expected, $msg);
>
> That still looks like a shortcut to me, weak to race conditions on
> slow machines where more log entries could be generated in-between.
> So it seems to me that you'd still want to make sure that the CONTEXT
> from which the LSN is retrieved maps to the sole expected error.  This
> is not going to be stable unless there are stronger checks to avoid
> log entries that can parasite the output, and a stronger matching
> ensures that.
>

This can only be a problem if the apply worker generates more LOG
entries with the required context but it won't do that unless there is
an operation on the publisher to replicate. If we see any such danger
then I agree this can break on some slow machines but as of now, I
don't see any such race condition.

--
With Regards,
Amit Kapila.



Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

От
Michael Paquier
Дата:
On Thu, May 16, 2024 at 09:00:47AM +0530, Amit Kapila wrote:
> This can only be a problem if the apply worker generates more LOG
> entries with the required context but it won't do that unless there is
> an operation on the publisher to replicate. If we see any such danger
> then I agree this can break on some slow machines but as of now, I
> don't see any such race condition.

Perhaps, still I'm not completely sure if this assumption is going to
always stand for all the possible configurations we support.  So,
rather than coming back to this test again, I would choose to make the
test as stable as possible from the start.  That's what mapping with
the error message would offer when grabbing the LSN from the CONTEXT
part of it, because there's a one-one mapping between the expected
ERROR and its CONTEXT from which the information used by the test is
retrieved.
--
Michael

Вложения
On Fri, May 17, 2024 at 5:25 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 16, 2024 at 09:00:47AM +0530, Amit Kapila wrote:
> > This can only be a problem if the apply worker generates more LOG
> > entries with the required context but it won't do that unless there is
> > an operation on the publisher to replicate. If we see any such danger
> > then I agree this can break on some slow machines but as of now, I
> > don't see any such race condition.
>
> Perhaps, still I'm not completely sure if this assumption is going to
> always stand for all the possible configurations we support.
>

As per my understanding, this will primarily rely on the apply worker
design, not the other configurations, whether we see additional LOG.

>  So,
> rather than coming back to this test again, I would choose to make the
> test as stable as possible from the start.  That's what mapping with
> the error message would offer when grabbing the LSN from the CONTEXT
> part of it, because there's a one-one mapping between the expected
> ERROR and its CONTEXT from which the information used by the test is
> retrieved.
>

I was slightly hesitant to do an ERROR string-based check because the
error string can change and it doesn't seem to bring additional
stability for this particular test. But if you and others are still
not convinced with the simple fix suggested by me then feel free to
proceed with an error-string based check.

--
With Regards,
Amit Kapila.



Dear Michael, Amit, Hayato

I corrected my patch according to what I think
Michael wanted. I attached the new patch to the letter.

--
Kind regards,
Ian Ilyasov.

Junior Software Developer at Postgres Professional
Вложения

Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

От
Michael Paquier
Дата:
On Wed, May 22, 2024 at 02:24:37PM +0000, Ilyasov Ian wrote:
> I corrected my patch according to what I think
> Michael wanted. I attached the new patch to the letter.

Thanks for compiling this patch.  Yes, that's the idea.

-      qr/processing remote data for replication origin \"pg_\d+\" during message type "INSERT" for replication target
relation"public.tbl" in transaction \d+, finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/
 
+      qr/ERROR:  duplicate key.*\n.*DETAIL:.*\n.*CONTEXT:.* finished at ([[:xdigit:]]+\/[[:xdigit:]]+)/m

There are three CONTEXT strings that could map to this context.  It
seems to me that we should keep the 'for replication target relation
"public.tbl" in transaction \d+,', before the "finished at" so as it
is possible to make a difference with the context that has a column
name and the context where there is no target relation.  That makes
the regexp longer, but it is not that bad.
--
Michael

Вложения
> It seems to me that we should keep the 'for replication target relation
"public.tbl" in transaction \d+,', before the "finished at" so as it
is possible to make a difference with the context that has a column
name and the context where there is no target relation.

I agree. Attached the updated patch.

--
Kind regards,
Ian Ilyasov.

Junior Software Developer at Postgres Professional
Вложения

Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

От
Michael Paquier
Дата:
On Thu, May 23, 2024 at 08:12:07AM +0000, Ilyasov Ian wrote:
> > It seems to me that we should keep the 'for replication target relation
> "public.tbl" in transaction \d+,', before the "finished at" so as it
> is possible to make a difference with the context that has a column
> name and the context where there is no target relation.
>
> I agree. Attached the updated patch.

One issue that you have here is that the regexp detection would still
fail when setting log_error_verbosity = verbose because of the extra
error code added between the ERROR and the string.  This caused the
LSN to not be fetched properly.

At the end, I've come up with a regexp that checks for a match of the
error string after the ERROR to not confuse the last part getting the
xdigits, and applied that down to 15.  Perhaps I would have added the
first "ERROR" part in the check, but could not come down to it for the
readability of the thing.
--
Michael

Вложения