Обсуждение: PATCH: logical_work_mem and logical streaming of large in-progresstransactions

Поиск
Список
Период
Сортировка

PATCH: logical_work_mem and logical streaming of large in-progresstransactions

От
Tomas Vondra
Дата:
Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.


PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer.

Secondly, it adds a simple memory accounting by tracking the amount of
memory used in total (for the whole reorder buffer, to compare against
logical_work_mem) and per transaction (so that we can quickly pick
transaction to spill to disk).

The one wrinkle on the patch is that the memory limit can't be enforced
when reading changes spilled to disk - with multiple subtransactions, we
can't easily predict how many changes to pre-read for each of them. At
that point we still use the existing max_changes_in_memory limit.

Luckily, changes introduced in the other parts of the patch should allow
addressing this deficiency.


PART 2: streaming of large in-progress transactions (0002-0006)
---------------------------------------------------------------

Note: This part is split into multiple smaller chunks, addressing
different parts of the logical decoding infrastructure. That's mostly to
allow easier reviews, though. Ultimately, it's just one patch.

Processing large transactions often results in significant apply lag,
for a couple of reasons. One reason is network bandwidth - while we do
decode the changes incrementally (as we read the WAL), we keep them
locally, either in memory, or spilled to files. Then at commit time, all
the changes get sent to the downstream (and applied) at the same time.
For large transactions the time to do the network transfer may be
significant, causing apply lag.

This patch extends the logical replication infrastructure (output plugin
API, reorder buffer, pgoutput, replication protocol etc.) so allow
streaming of in-progress transactions instead of spilling them to local
files.

The extensions to the API are pretty straightforward. Aside from adding
methods to stream changes/messages and commit a streamed transaction,
the API needs a function to abort a streamed (sub)transaction, and
functions to demarcate a block of streamed changes.

To decode a transaction, we need to know all it's subtransactions, and
invalidations. Currently, those are only known at commit time (although
some assignments may be known earlier), but invalidations are only ever
written in the commit record.

So far that was fine, because we only decode/replay transactions at
commit time, when all of this is known (because it's either in commit
record, or written before it).

But for in-progress transactions (i.e. the subject of interest here),
that is not the case. So the patch modifies WAL-logging to ensure those
two bits of information are written immediately (for wal_level=logical).

For assignments that was fairly simple, thanks to existing caching. For
invalidations, it requires a new WAL record type and a couple of changes
in inval.c.

On the apply side, we simply receive the streamed changes, write them
into a file (one file for toplevel transaction, which is possible thanks
to the assignments being known immediately). And then at commit time the
changes are replayed locally, without having to copy a large chunk of
data over network.


WAL overhead
------------

Of course, these changes to WAL logging are not for free - logging
assignments individually (instead of multiple subtransactions at once)
means higher xlog record overhead. Similarly, (sub)transactions doing a
lot of DDL may result in a lot of invalidations written to WAL (again,
with full xlog record overhead per invalidation).

I've done a number of tests to measure the impact, and for extreme
corner cases the additional amount of WAL is about 40% in both cases.

By an "extreme corner case" I mean a workloads intentionally triggering
many assignments/invalidations, without doing a lot of meaningful work.

For assignments, imagine a single-row table (no indexes), and a
transaction like this one:

    BEGIN;
    UPDATE t SET v = v + 1;
    SAVEPOINT s1;
    UPDATE t SET v = v + 1;
    SAVEPOINT s2;
    UPDATE t SET v = v + 1;
    SAVEPOINT s3;
    ...
    UPDATE t SET v = v + 1;
    SAVEPOINT s10;
    UPDATE t SET v = v + 1;
    COMMIT;

For invalidations, add a CREATE TEMPORARY TABLE to each subtransaction.

For more realistic workloads (large table with indexes, runs long enough
to generate FPIs, etc.) the overhead drops below 5%. Which is much more
acceptable, of course, although not perfect.

In both cases, there was pretty much no measurable impact on performance
(as measured by tps).

I do not think there's a way around this requirement (having assignments
and invalidations), if we want to decode in-progress transactions. But
perhaps it would be possible to do some sort of caching (say, at command
level), to reduce the xlog record overhead? Not sure.

All ideas are welcome, of course. In the worst case, I think we can add
a GUC enabling this additional logging - when disabled, streaming of
in-progress transactions would not be possible.


Simplifying ReorderBuffer
-------------------------

One interesting consequence of having assignments is that we could get
rid of the ReorderBuffer iterator, used to merge changes from subxacts.
The assignments allow us to keep changes for each toplevel transaction
in a single list, in LSN order, and just walk it. Abort can be performed
by remembering position of the first change in each subxact, and just
discarding the tail.

This is what the apply worker does with the streamed changes and aborts.

It would also allow us to enforce the memory limit while restoring
transactions spilled to disk, because we would not have the problem with
restoring changes for many subtransactions.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erikjan Rijkers
Дата:
On 2017-12-23 05:57, Tomas Vondra wrote:
> Hi all,
> 
> Attached is a patch series that implements two features to the logical
> replication - ability to define a memory limit for the reorderbuffer
> (responsible for building the decoded transactions), and ability to
> stream large in-progress transactions (exceeding the memory limit).
> 

logical replication of 2 instances is OK but 3 and up fail with:

TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: 
"reorderbuffer.c", Line: 1773)

I can cobble up a script but I hope you have enough from the assertion 
to see what's going wrong...


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
> On 2017-12-23 05:57, Tomas Vondra wrote:
>> Hi all,
>>
>> Attached is a patch series that implements two features to the logical
>> replication - ability to define a memory limit for the reorderbuffer
>> (responsible for building the decoded transactions), and ability to
>> stream large in-progress transactions (exceeding the memory limit).
>>
> 
> logical replication of 2 instances is OK but 3 and up fail with:
> 
> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
> "reorderbuffer.c", Line: 1773)
> 
> I can cobble up a script but I hope you have enough from the assertion
> to see what's going wrong...

The assertion says that the iterator produces changes in order that does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.

So instructions to reproduce the issue would be very helpful.

Attached is v2 of the patch series, fixing two bugs I discovered today.
I don't think any of these is related to your issue, though.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2017-12-23 21:06, Tomas Vondra wrote:
> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
>> On 2017-12-23 05:57, Tomas Vondra wrote:
>>> Hi all,
>>> 
>>> Attached is a patch series that implements two features to the 
>>> logical
>>> replication - ability to define a memory limit for the reorderbuffer
>>> (responsible for building the decoded transactions), and ability to
>>> stream large in-progress transactions (exceeding the memory limit).
>>> 
>> 
>> logical replication of 2 instances is OK but 3 and up fail with:
>> 
>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>> "reorderbuffer.c", Line: 1773)
>> 
>> I can cobble up a script but I hope you have enough from the assertion
>> to see what's going wrong...
> 
> The assertion says that the iterator produces changes in order that 
> does
> not correlate with LSN. But I have a hard time understanding how that
> could happen, particularly because according to the line number this
> happens in ReorderBufferCommit(), i.e. the current (non-streaming) 
> case.
> 
> So instructions to reproduce the issue would be very helpful.

Using:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
0006-Add-support-for-streaming-to-built-in-replication-v2.patch

As you expected the problem is the same with these new patches.

I have now tested more, and seen that it not always fails.  I guess that 
it here fails 3 times out of 4.  But the laptop I'm using at the moment 
is old and slow -- it may well be a factor as we've seen before [1].

Attached is the bash that I put together.  I tested with 
NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails 
often.  This same program run with HEAD never seems to fail (I tried a 
few dozen times).

thanks,

Erik Rijkers


[1] 
https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl


Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

On 12/23/2017 11:23 PM, Erik Rijkers wrote:
> On 2017-12-23 21:06, Tomas Vondra wrote:
>> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
>>> On 2017-12-23 05:57, Tomas Vondra wrote:
>>>> Hi all,
>>>>
>>>> Attached is a patch series that implements two features to the logical
>>>> replication - ability to define a memory limit for the reorderbuffer
>>>> (responsible for building the decoded transactions), and ability to
>>>> stream large in-progress transactions (exceeding the memory limit).
>>>>
>>>
>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>
>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>> "reorderbuffer.c", Line: 1773)
>>>
>>> I can cobble up a script but I hope you have enough from the assertion
>>> to see what's going wrong...
>>
>> The assertion says that the iterator produces changes in order that does
>> not correlate with LSN. But I have a hard time understanding how that
>> could happen, particularly because according to the line number this
>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.
>>
>> So instructions to reproduce the issue would be very helpful.
> 
> Using:
> 
> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
> 
> As you expected the problem is the same with these new patches.
> 
> I have now tested more, and seen that it not always fails.  I guess that
> it here fails 3 times out of 4.  But the laptop I'm using at the moment
> is old and slow -- it may well be a factor as we've seen before [1].
> 
> Attached is the bash that I put together.  I tested with
> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
> often.  This same program run with HEAD never seems to fail (I tried a
> few dozen times).
> 

Thanks. Unfortunately I still can't reproduce the issue. I even tried
running it in valgrind, to see if there are some memory access issues
(which should also slow it down significantly).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Craig Ringer
Дата:
On 23 December 2017 at 12:57, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.


PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

Also, even without subtransactions, we assemble a ReorderBufferTXN per transaction. Since transactions usually occur concurrently, systems with many concurrent txns can face lots of memory use.

We can't exclude tables that won't actually be replicated at the reorder buffering phase either. So txns use memory whether or not they do anything interesting as far as a given logical decoding session is concerned. Even if we'll throw all the data away we must buffer and assemble it first so we can make that decision.

Because logical decoding considers snapshots and cid increments even from other DBs (at least when the txn makes catalog changes) the memory use can get BIG too. I was recently working with a system that had accumulated 2GB of snapshots ... on each slot. With 7 slots, one for each DB.

So there's lots of room for difficulty with unpredictable memory use.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer

Does this consider the (currently high, IIRC) overhead of tracking serialized changes?
 
--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
>>>> 
>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>> 
>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>> "reorderbuffer.c", Line: 1773)
>>>> 
>>>> I can cobble up a script but I hope you have enough from the 
>>>> assertion
>>>> to see what's going wrong...
>>> 
>>> The assertion says that the iterator produces changes in order that 
>>> does
>>> not correlate with LSN. But I have a hard time understanding how that
>>> could happen, particularly because according to the line number this
>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) 
>>> case.
>>> 
>>> So instructions to reproduce the issue would be very helpful.
>> 
>> Using:
>> 
>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>> 
>> As you expected the problem is the same with these new patches.
>> 
>> I have now tested more, and seen that it not always fails.  I guess 
>> that
>> it here fails 3 times out of 4.  But the laptop I'm using at the 
>> moment
>> is old and slow -- it may well be a factor as we've seen before [1].
>> 
>> Attached is the bash that I put together.  I tested with
>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which 
>> fails
>> often.  This same program run with HEAD never seems to fail (I tried a
>> few dozen times).
>> 
> 
> Thanks. Unfortunately I still can't reproduce the issue. I even tried
> running it in valgrind, to see if there are some memory access issues
> (which should also slow it down significantly).

One wonders again if 2ndquadrant shouldn't invest in some old hardware 
;)

Another Good Thing would be if there was a provision in the buildfarm to 
test patches like these.

But I'm probably not to first one to suggest that; no doubt it'll be 
possible someday.  In the meantime I'll try to repeat this crash on 
other machines (but that will be after the holidays).


Erik Rijkers


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

On 12/24/2017 05:51 AM, Craig Ringer wrote:
> On 23 December 2017 at 12:57, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>     Hi all,
> 
>     Attached is a patch series that implements two features to the logical
>     replication - ability to define a memory limit for the reorderbuffer
>     (responsible for building the decoded transactions), and ability to
>     stream large in-progress transactions (exceeding the memory limit).
> 
>     I'm submitting those two changes together, because one builds on the
>     other, and it's beneficial to discuss them together.
> 
> 
>     PART 1: adding logical_work_mem memory limit (0001)
>     ---------------------------------------------------
> 
>     Currently, limiting the amount of memory consumed by logical decoding is
>     tricky (or you might say impossible) for several reasons:
> 
>     * The value is hard-coded, so it's not quite possible to customize it.
> 
>     * The amount of decoded changes to keep in memory is restricted by
>     number of changes. It's not very unclear how this relates to memory
>     consumption, as the change size depends on table structure, etc.
> 
>     * The number is "per (sub)transaction", so a transaction with many
>     subtransactions may easily consume significant amount of memory without
>     actually hitting the limit.
> 
> 
> Also, even without subtransactions, we assemble a ReorderBufferTXN
> per transaction. Since transactions usually occur concurrently,
> systems with many concurrent txns can face lots of memory use.
> 

I don't see how that could be a problem, considering the number of
toplevel transactions is rather limited (to max_connections or so).

> We can't exclude tables that won't actually be replicated at the reorder
> buffering phase either. So txns use memory whether or not they do
> anything interesting as far as a given logical decoding session is
> concerned. Even if we'll throw all the data away we must buffer and
> assemble it first so we can make that decision.

Yep.

> Because logical decoding considers snapshots and cid increments even
> from other DBs (at least when the txn makes catalog changes) the memory
> use can get BIG too. I was recently working with a system that had
> accumulated 2GB of snapshots ... on each slot. With 7 slots, one for
> each DB.
> 
> So there's lots of room for difficulty with unpredictable memory use.
> 

Yep.

>     So the patch does two things. Firstly, it introduces logical_work_mem, a
>     GUC restricting memory consumed by all transactions currently kept in
>     the reorder buffer
> 
> 
> Does this consider the (currently high, IIRC) overhead of tracking
> serialized changes?
>  

Consider in what sense?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

On 12/24/2017 10:00 AM, Erik Rijkers wrote:
>>>>>
>>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>>>
>>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>>> "reorderbuffer.c", Line: 1773)
>>>>>
>>>>> I can cobble up a script but I hope you have enough from the assertion
>>>>> to see what's going wrong...
>>>>
>>>> The assertion says that the iterator produces changes in order that
>>>> does
>>>> not correlate with LSN. But I have a hard time understanding how that
>>>> could happen, particularly because according to the line number this
>>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
>>>> case.
>>>>
>>>> So instructions to reproduce the issue would be very helpful.
>>>
>>> Using:
>>>
>>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>>>
>>> As you expected the problem is the same with these new patches.
>>>
>>> I have now tested more, and seen that it not always fails.  I guess that
>>> it here fails 3 times out of 4.  But the laptop I'm using at the moment
>>> is old and slow -- it may well be a factor as we've seen before [1].
>>>
>>> Attached is the bash that I put together.  I tested with
>>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
>>> often.  This same program run with HEAD never seems to fail (I tried a
>>> few dozen times).
>>>
>>
>> Thanks. Unfortunately I still can't reproduce the issue. I even tried
>> running it in valgrind, to see if there are some memory access issues
>> (which should also slow it down significantly).
> 
> One wonders again if 2ndquadrant shouldn't invest in some old hardware ;)
> 

Well, I've done tests on various machines, including some really slow
ones, and I still haven't managed to reproduce the failures using your
script. So I don't think that would really help. But I have reproduced
it by using a custom stress test script.

Turns out the asserts are overly strict - instead of

  Assert(prev_lsn < current_lsn);

it should have been

  Assert(prev_lsn <= current_lsn);

because some XLOG records may contain multiple rows (e.g. MULTI_INSERT).

The attached v3 fixes this issue, and also a couple of other thinkos:

1) The AssertChangeLsnOrder assert check was somewhat broken.

2) We've been sending aborts for all subtransactions, even those not yet
streamed. So downstream got confused and fell over because of an assert.

3) The streamed transactions were written to /tmp, using filenames using
subscription OID and XID of the toplevel transaction. That's fine, as
long as there's just a single replica running - if there are more, the
filenames will clash, causing really strange failures. So move the files
to base/pgsql_tmp where regular temporary files are written. I'm not
claiming this is perfect, perhaps we need to invent another location.

FWIW I believe the relation sync cache is somewhat broken by the
streaming. I thought resetting it would be good enough, but it's more
complicated (and trickier) than that. I'm aware of it, and I'll look
into that next - but probably not before 2018.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
That indeed fixed the problem: running that same pgbench test, I see no 
crashes anymore (on any of 3 different machines, and with several 
pgbench parameters).

Thank you,

Erik Rijkers


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dmitry Dolgov
Дата:
> On 25 December 2017 at 18:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> The attached v3 fixes this issue, and also a couple of other thinkos

Thank you for the patch, it looks quite interesting. After a quick look at it
(mostly the first one so far, but I'm going to continue) I have a few questions:

> + * XXX With many subtransactions this might be quite slow, because we'll have
> + * to walk through all of them. There are some options how we could improve
> + * that: (a) maintain some secondary structure with transactions sorted by
> + * amount of changes, (b) not looking for the entirely largest transaction,
> + * but e.g. for transaction using at least some fraction of the memory limit,
> + * and (c) evicting multiple transactions at once, e.g. to free a given portion
> + * of the memory limit (e.g. 50%).

Do you want to address these possible alternatives somehow in this patch or you
want to left it outside? Maybe it makes sense to apply some combination of
them, e.g. maintain a secondary structure with relatively large transactions,
and then start evicting them. If it's somehow not enough, then start to evict
multiple transactions at once (option "c").

> + /*
> + * We clamp manually-set values to at least 64kB. The maintenance_work_mem
> + * uses a higher minimum value (1MB), so this is OK.
> + */
> + if (*newval < 64)
> + *newval = 64;
> +

I'm not sure what's recommended practice here, but maybe it makes sense to
have a warning here about changing this value to 64kB? Otherwise it can be
unexpected.

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Peter Eisentraut
Дата:
On 12/22/17 23:57, Tomas Vondra wrote:
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------

The documentation in this patch contains some references to later
features (streaming).  Perhaps that could be separated so that the
patches can be applied independently.

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could then
have undesirable side effects on this use.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem.  Maybe we could think of another name.

I think we need a way to report on how much memory is actually used, so
the setting can be tuned.  Something analogous to log_temp_files perhaps.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

On 01/02/2018 04:07 PM, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
> 
> The documentation in this patch contains some references to later
> features (streaming).  Perhaps that could be separated so that the
> patches can be applied independently.
> 

Yeah, that's probably a good idea. But now that you mention it, I wonder
if "streaming" is really a good term. We already use it for "streaming
replication" and it may be quite confusing to use it for another feature
(particularly when it's streaming within logical streaming replication).

But I can't really think of a better name ...

> I don't see the need to tie this setting to maintenance_work_mem. 
> maintenance_work_mem is often set to very large values, which could
> then have undesirable side effects on this use.
> 

Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.

Pretty much any default value can have undesirable side effects.

> Moreover, the name logical_work_mem makes it sound like it's a logical
> version of work_mem.  Maybe we could think of another name.
> 

I won't object to a better name, of course. Any proposals?

> I think we need a way to report on how much memory is actually used,
> so the setting can be tuned. Something analogous to log_temp_files
> perhaps.
> 

Yes, I agree. I'm just about to submit an updated version of the patch
series, that also introduces a few columns pg_stat_replication, tracking
this type of stats (amount of data spilled to disk or streamed, etc.).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
Hi,

attached is v4 of the patch series, with a couple of changes:

1) Fixes a bunch of bugs I discovered during stress testing.

I'm not going to go into details, but the main fixes are related to
properly updating progress from the worker, and not streaming when
creating the logical replication slot.

2) Introduces columns into pg_stat_replication.

The new columns track various kinds of statistics (number of xacts,
bytes, ...) about spill-to-disk/streaming. This will be useful when
tuning the GUC memory limit.

3) Two temporary bugfixes that make the patch series work.

The first one (0008) makes sure is_known_subxact is set properly for all
subtransactions, and there's a separate fix in the CF. So this will
eventually go away.

The second one (0009) fixes an issue that is specific to streaming. It
does fix the issue, but I need a bit more time to think about it before
merging it into 0005.

This does pass extensive stress testing with a workload mixing DML, DDL,
subtransactions, aborts, etc. under valgrind. I'm working on extending
the test coverage, and introducing various error conditions (e.g.
walsender/walreceiver timeouts, failures on both ends, etc.).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

On 01/03/2018 09:06 PM, Tomas Vondra wrote:
> Hi,
> 
> attached is v4 of the patch series, with a couple of changes:
> 
> 1) Fixes a bunch of bugs I discovered during stress testing.
> 
> I'm not going to go into details, but the main fixes are related to
> properly updating progress from the worker, and not streaming when
> creating the logical replication slot.
> 
> 2) Introduces columns into pg_stat_replication.
> 
> The new columns track various kinds of statistics (number of xacts,
> bytes, ...) about spill-to-disk/streaming. This will be useful when
> tuning the GUC memory limit.
> 
> 3) Two temporary bugfixes that make the patch series work.
> 

Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to
allow customizing the streaming and memory limit. So you can do

    CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024)

and this subscription will allow streaming, and the logica_work_mem (on
provider) will be set to 1MB.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Peter Eisentraut
Дата:
On 1/3/18 14:53, Tomas Vondra wrote:
>> I don't see the need to tie this setting to maintenance_work_mem. 
>> maintenance_work_mem is often set to very large values, which could
>> then have undesirable side effects on this use.
> 
> Well, we need to pick some default value, and we can either use a fixed
> value (not sure what would be a good default) or tie it to an existing
> GUC. We only really have work_mem and maintenance_work_mem, and the
> walsender process will never use more than one such buffer. Which seems
> to be closer to maintenance_work_mem.
> 
> Pretty much any default value can have undesirable side effects.

Let's just make it an independent setting unless we know any better.  We
don't have a lot of settings that depend on other settings, and the ones
we do have a very specific relationship.

>> Moreover, the name logical_work_mem makes it sound like it's a logical
>> version of work_mem.  Maybe we could think of another name.
> 
> I won't object to a better name, of course. Any proposals?

logical_decoding_[work_]mem?

>> I think we need a way to report on how much memory is actually used,
>> so the setting can be tuned. Something analogous to log_temp_files
>> perhaps.
> 
> Yes, I agree. I'm just about to submit an updated version of the patch
> series, that also introduces a few columns pg_stat_replication, tracking
> this type of stats (amount of data spilled to disk or streamed, etc.).

That seems OK.  Perhaps we could bring forward the part of that patch
that applies to this feature.

That would also help testing *this* feature and determine what
appropriate settings are.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Peter Eisentraut
Дата:
On 1/3/18 15:13, Tomas Vondra wrote:
> Forgot to mention that the v4 also extends the CREATE SUBSCRIPTION to
> allow customizing the streaming and memory limit. So you can do
> 
>     CREATE SUBSCRIPTION ... WITH (streaming=on, work_mem=1024)
> 
> and this subscription will allow streaming, and the logica_work_mem (on
> provider) will be set to 1MB.

I was wondering already during PG10 development whether we should give
subscriptions a generic configuration array, like databases and roles
have, so we don't have to hardcode a bunch of similar stuff every time
we add an option like this.  At the time we only had synchronous_commit,
but now we're adding more.

Also, instead of sticking this into the START_REPLICATION command, could
we just run a SET command?  That should work over replication
connections as well.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Peter Eisentraut
Дата:
On 12/22/17 23:57, Tomas Vondra wrote:
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------
> 
> Currently, limiting the amount of memory consumed by logical decoding is
> tricky (or you might say impossible) for several reasons:

I would like to see some more discussion on this, but I think not a lot
of people understand the details, so I'll try to write up an explanation
here.  This code is also somewhat new to me, so please correct me if
there are inaccuracies, while keeping in mind that I'm trying to simplify.

The data in the WAL is written as it happens, so the changes belonging
to different transactions are all mixed together.  One of the jobs of
logical decoding is to reassemble the changes belonging to each
transaction.  The top-level data structure for that is the infamous
ReorderBuffer.  So as it reads the WAL and sees something about a
transaction, it keeps a copy of that change in memory, indexed by
transaction ID (ReorderBufferChange).  When the transaction commits, the
accumulated changes are passed to the output plugin and then freed.  If
the transaction aborts, then changes are just thrown away.

So when logical decoding is active, a copy of the changes for each
active transaction is kept in memory (once per walsender).

More precisely, the above happens for each subtransaction.  When the
top-level transaction commits, it finds all its subtransactions in the
ReorderBuffer, reassembles everything in the right order, then invokes
the output plugin.

All this could end up using an unbounded amount of memory, so there is a
mechanism to spill changes to disk.  The way this currently works is
hardcoded, and this patch proposes to change that.

Currently, when a transaction or subtransaction has accumulated 4096
changes, it is spilled to disk.  When the top-level transaction commits,
things are read back from disk to do the final processing mentioned above.

This all works mostly fine, but you can construct some more extreme
cases where this can blow up.

Here is a mundane example.  Let's say a change entry takes 100 bytes (it
might contain a new row, or an update key and some new column values,
for example).  If you have 100 concurrent active sessions and no
subtransactions, then logical decoding memory is bounded by 4096 * 100 *
100 = 40 MB (per walsender) before things spill to disk.

Now let's say you are using a lot of subtransactions, because you are
using PL functions, exception handling, triggers, doing batch updates.
If you have 200 subtransactions on average per concurrent session, the
memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
(per walsender).  And so on.  If you have more concurrent sessions or
larger changes or more subtransactions, you'll use much more than those
8 GB.  And if you don't have those 8 GB, then you're stuck at this point.

That is the consideration when we record changes, but we also need
memory when we do the final processing at commit time.  That is slightly
less problematic because we only process one top-level transaction at a
time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
(without the concurrent sessions factor).

So, this patch proposes to improve this as follows:

- We compute the actual size of each ReorderBufferChange and keep a
running tally for each transaction, instead of just counting the number
of changes.

- We have a configuration setting that allows us to change the limit
instead of the hardcoded 4096.  The configuration setting is also in
terms of memory, not in number of changes.

- The configuration setting is for the total memory usage per decoding
session, not per subtransaction.  (So we also keep a running tally for
the entire ReorderBuffer.)

There are two open issues with this patch:

One, this mechanism only applies when recording changes.  The processing
at commit time still uses the previous hardcoded mechanism.  The reason
for this is, AFAIU, that as things currently work, you have to have all
subtransactions in memory to do the final processing.  There are some
proposals to change this as well, but they are more involved.  Arguably,
per my explanation above, memory use at commit time is less likely to be
a problem.

Two, what to do when the memory limit is reached.  With the old
accounting, this was easy, because we'd decide for each subtransaction
independently whether to spill it to disk, when it has reached its 4096
limit.  Now, we are looking at a global limit, so we have to find a
transaction to spill in some other way.  The proposed patch searches
through the entire list of transactions to find the largest one.  But as
the patch says:

"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for the entirely largest
transaction, but e.g. for transaction using at least some fraction of
the memory limit, and (c) evicting multiple transactions at once, e.g.
to free a given portion of the memory limit (e.g. 50%)."

(a) would create more overhead for the case where everything fits into
memory, so it seems unattractive.  Some combination of (b) and (c) seems
useful, but we'd have to come up with something concrete.

Thoughts?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Greg Stark
Дата:
On 11 January 2018 at 19:41, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

> Two, what to do when the memory limit is reached.  With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit.  Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way.  The proposed patch searches
> through the entire list of transactions to find the largest one.  But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."

AIUI spilling to disk doesn't affect absorbing future updates, we
would just keep accumulating them in memory right? We won't need to
unspill until it comes time to commit.

Is there any actual advantage to picking the largest transaction? it
means fewer spills and fewer unspills at commit time but that just a
bigger spike of i/o and more of a chance of spilling more than
necessary to get by. In the end it'll be more or less the same amount
of data read back, just all in one big spike when spilling and one big
spike when committing. If you spilled smaller transactions you would
have a small amount of i/o more frequently and have to read back small
amounts for many commits. But it would add up to the same amount of
i/o (or less if you avoid spilling more than necessary).

The real aim should be to try to pick the transaction that will be
committed furthest in the future. That gives you the most memory to
use for live transactions for the longest time and could let you
process the maximum amount of transactions without spilling them. So
either the oldest transaction (in the expectation that it's been open
a while and appears to be a long-lived batch job that will stay open
for a long time) or the youngest transaction (in the expectation that
all transactions are more or less equally long-lived) might make
sense.



-- 
greg


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Peter Eisentraut
Дата:
On 1/11/18 18:23, Greg Stark wrote:
> AIUI spilling to disk doesn't affect absorbing future updates, we
> would just keep accumulating them in memory right? We won't need to
> unspill until it comes time to commit.

Once a transaction has been serialized, future updates keep accumulating
in memory, until perhaps it gets serialized again.  But then at commit
time, if a transaction has been partially serialized at all, all the
remaining changes are also serialized before the whole thing is read
back in (see reorderbuffer.c line 855).

So one optimization would be to specially keep track of all transactions
that have been serialized already and pick those first for further
serialization, because it will be done eventually anyway.

But this is only a secondary optimization, because it doesn't help in
the extreme cases that either no (or few) transactions have been
serialized or all (or most) transactions have been serialized.

> The real aim should be to try to pick the transaction that will be
> committed furthest in the future. That gives you the most memory to
> use for live transactions for the longest time and could let you
> process the maximum amount of transactions without spilling them. So
> either the oldest transaction (in the expectation that it's been open
> a while and appears to be a long-lived batch job that will stay open
> for a long time) or the youngest transaction (in the expectation that
> all transactions are more or less equally long-lived) might make
> sense.

Yes, that makes sense.  We'd still need to keep a separate ordered list
of transactions somewhere, but that might be easier if we just order
them in the order we see them.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

On 01/11/2018 08:41 PM, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>>
>> Currently, limiting the amount of memory consumed by logical decoding is
>> tricky (or you might say impossible) for several reasons:
> 
> I would like to see some more discussion on this, but I think not a lot
> of people understand the details, so I'll try to write up an explanation
> here.  This code is also somewhat new to me, so please correct me if
> there are inaccuracies, while keeping in mind that I'm trying to simplify.
> 
> ... snip ...

Thanks for a comprehensive summary of the patch!

> 
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
> 
> (a) would create more overhead for the case where everything fits into
> memory, so it seems unattractive.  Some combination of (b) and (c) seems
> useful, but we'd have to come up with something concrete.
> 

Yeah, when writing that comment I was worried that (a) might get rather
expensive. I was thinking about maintaining a dlist of transactions
sorted by size (ReorderBuffer now only has a hash table), so that we
could evict transactions from the beginning of the list.

But while that speeds up the choice of transactions to evict, the added
cost is rather high, particularly when most transactions are roughly of
the same size. Because in that case we probably have to move the nodes
around in the list quite often. So it seems wiser to just walk the list
once when looking for a victim.

What I'm thinking about instead is tracking just some approximated
version of this - it does not really matter whether we evict the really
largest transaction or one that is a couple of kilobytes smaller. What
we care about is an answer to this question:

    Is there some very large transaction that we could evict to free
    a lot of memory, or are all transactions fairly small?

So perhaps we can define some "size classes" and track to which of them
each transaction belongs. For example, we could split the memory limit
into 100 buckets, each representing a 1% size increment.

A transaction would not switch the class very often, and it would be
trivial to pick the largest transaction. When all the transactions are
squashed in the smallest classes, we may switch to some alternative
strategy. Not sure.

In any case, I don't really know how expensive the selection actually
is, and if it's an issue. I'll do some measurements.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

On 01/12/2018 05:35 PM, Peter Eisentraut wrote:
> On 1/11/18 18:23, Greg Stark wrote:
>> AIUI spilling to disk doesn't affect absorbing future updates, we
>> would just keep accumulating them in memory right? We won't need to
>> unspill until it comes time to commit.
> 
> Once a transaction has been serialized, future updates keep accumulating
> in memory, until perhaps it gets serialized again.  But then at commit
> time, if a transaction has been partially serialized at all, all the
> remaining changes are also serialized before the whole thing is read
> back in (see reorderbuffer.c line 855).
> 
> So one optimization would be to specially keep track of all transactions
> that have been serialized already and pick those first for further
> serialization, because it will be done eventually anyway.
> 
> But this is only a secondary optimization, because it doesn't help in
> the extreme cases that either no (or few) transactions have been
> serialized or all (or most) transactions have been serialized.
> 
>> The real aim should be to try to pick the transaction that will be
>> committed furthest in the future. That gives you the most memory to
>> use for live transactions for the longest time and could let you
>> process the maximum amount of transactions without spilling them. So
>> either the oldest transaction (in the expectation that it's been open
>> a while and appears to be a long-lived batch job that will stay open
>> for a long time) or the youngest transaction (in the expectation that
>> all transactions are more or less equally long-lived) might make
>> sense.
> 
> Yes, that makes sense.  We'd still need to keep a separate ordered list
> of transactions somewhere, but that might be easier if we just order
> them in the order we see them.
> 

Wouldn't the 'toplevel_by_lsn' be suitable for this? Subtransactions
don't really commit independently, but as part of the toplevel xact. And
that list is ordered by LSN, which is pretty much exactly the order in
which we see the transactions.

I feel somewhat uncomfortable about evicting oldest (or youngest)
transactions for based on some assumed correlation with the commit
order. I'm pretty sure that will bite us badly for some workloads.

Another somewhat non-intuitive detail is that because ReorderBuffer
switched to Generation allocator for changes (which usually represent
99% of the memory used during decoding), it does not reuse memory the
way AllocSet does. Actually, it does not reuse memory at all, aiming to
eventually give the memory back to libc (which AllocSet can't do).

Because of this evicting the youngest transactions seems like a quite
bad idea, because those chunks will not be reused and there may be other
chunks on the blocks, preventing their release.

Yeah, complicated stuff.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Peter Eisentraut
Дата:
On 1/12/18 23:19, Tomas Vondra wrote:
> Wouldn't the 'toplevel_by_lsn' be suitable for this? Subtransactions
> don't really commit independently, but as part of the toplevel xact. And
> that list is ordered by LSN, which is pretty much exactly the order in
> which we see the transactions.

Yes indeed.  There is even ReorderBufferGetOldestTXN().

> Another somewhat non-intuitive detail is that because ReorderBuffer
> switched to Generation allocator for changes (which usually represent
> 99% of the memory used during decoding), it does not reuse memory the
> way AllocSet does. Actually, it does not reuse memory at all, aiming to
> eventually give the memory back to libc (which AllocSet can't do).
> 
> Because of this evicting the youngest transactions seems like a quite
> bad idea, because those chunks will not be reused and there may be other
> chunks on the blocks, preventing their release.

Right.  But this raises the question whether we are doing the memory
accounting on the right level.  If we are doing all this tracking based
on ReorderBufferChanges, but then serializing changes possibly doesn't
actually free any memory in the operating system, that's no good.  Can
we get some usage statistics out of the memory context?  It seems like
we need to keep serializing transactions until we actually see the
memory context size drop.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On 01/19/2018 03:34 PM, Tomas Vondra wrote:
> Attached is v5, fixing a silly bug in part 0006, causing segfault when
> creating a subscription.
> 

Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
causing another failure. Hopefully v6 will pass the CI build, it does
pass a build with the same parameters on my system.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Masahiko Sawada
Дата:
On Sat, Jan 20, 2018 at 7:08 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 01/19/2018 03:34 PM, Tomas Vondra wrote:
>> Attached is v5, fixing a silly bug in part 0006, causing segfault when
>> creating a subscription.
>>
>
> Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
> causing another failure. Hopefully v6 will pass the CI build, it does
> pass a build with the same parameters on my system.

Thank you for working on this. This patch would be helpful for
synchronous replication.

I haven't looked at the code deeply yet, but I've reviewed the v6
patch set especially on subscriber side. All of the patches can be
applied to current HEAD cleanly. Here is review comment.

----
CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
on publisher side when starting replication. Probably we should check
the value on the subscriber side as well.

----
When streaming = on, if we drop subscription in the middle of
receiving stream changes, DROP SUBSCRIPTION could leak tmp files
(.chages file and .subxacts file). Also it also happens when a
transaction on upstream aborted without abort record.

----
Since we can change both streaming option and work_mem option by ALTER
SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.

----
If we create a subscription without any options, both
pg_subscription.substream and pg_subscription.subworkmem are set to
null. However, since GetSubscription are't aware of NULL we start the
replication with invalid options like follows.
LOG:  received replication command: START_REPLICATION SLOT "hoge_sub"
LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
publication_names '"hoge_pub"')

I think we can set substream to false and subworkmem to -1 instead of
null, and then makes libpqrcv_startstreaming not set streaming option
if stream is -1.

----
Some WARNING messages appeared. Maybe these are for debug purpose?

WARNING:  updating stream stats 0x1c12ef8 4 3 65604
WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Peter Eisentraut
Дата:
To close out this commit fest, I'm setting both of these patches as
returned with feedback, as there are apparently significant issues to be
addressed.  Feel free to move them to the next commit fest when you
think they are ready to be continued.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On 01/31/2018 07:53 AM, Masahiko Sawada wrote:
> On Sat, Jan 20, 2018 at 7:08 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 01/19/2018 03:34 PM, Tomas Vondra wrote:
>>> Attached is v5, fixing a silly bug in part 0006, causing segfault when
>>> creating a subscription.
>>>
>>
>> Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
>> causing another failure. Hopefully v6 will pass the CI build, it does
>> pass a build with the same parameters on my system.
> 
> Thank you for working on this. This patch would be helpful for
> synchronous replication.
> 
> I haven't looked at the code deeply yet, but I've reviewed the v6
> patch set especially on subscriber side. All of the patches can be
> applied to current HEAD cleanly. Here is review comment.
> 
> ----
> CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
> on publisher side when starting replication. Probably we should check
> the value on the subscriber side as well.
> 
> ----
> When streaming = on, if we drop subscription in the middle of
> receiving stream changes, DROP SUBSCRIPTION could leak tmp files
> (.chages file and .subxacts file). Also it also happens when a
> transaction on upstream aborted without abort record.
> 
> ----
> Since we can change both streaming option and work_mem option by ALTER
> SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.
> 
> ----
> If we create a subscription without any options, both
> pg_subscription.substream and pg_subscription.subworkmem are set to
> null. However, since GetSubscription are't aware of NULL we start the
> replication with invalid options like follows.
> LOG:  received replication command: START_REPLICATION SLOT "hoge_sub"
> LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
> publication_names '"hoge_pub"')
> 
> I think we can set substream to false and subworkmem to -1 instead of
> null, and then makes libpqrcv_startstreaming not set streaming option
> if stream is -1.
> 
> ----
> Some WARNING messages appeared. Maybe these are for debug purpose?
> 
> WARNING:  updating stream stats 0x1c12ef8 4 3 65604
> WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080
> 
> Regards,
> 

Thanks for the review! I'll address the issues in the next version of
the patch.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On 02/01/2018 03:51 PM, Peter Eisentraut wrote:
> To close out this commit fest, I'm setting both of these patches as
> returned with feedback, as there are apparently significant issues to be
> addressed.  Feel free to move them to the next commit fest when you
> think they are ready to be continued.
> 

Will do. Thanks for the feedback.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Andres Freund
Дата:
On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote:
> On 02/01/2018 03:51 PM, Peter Eisentraut wrote:
> > To close out this commit fest, I'm setting both of these patches as
> > returned with feedback, as there are apparently significant issues to be
> > addressed.  Feel free to move them to the next commit fest when you
> > think they are ready to be continued.
> > 
> 
> Will do. Thanks for the feedback.

Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
but I don't see a newer version posted?

Greetings,

Andres Freund


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On 03/02/2018 02:12 AM, Andres Freund wrote:
> On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote:
>> On 02/01/2018 03:51 PM, Peter Eisentraut wrote:
>>> To close out this commit fest, I'm setting both of these patches as
>>> returned with feedback, as there are apparently significant issues to be
>>> addressed.  Feel free to move them to the next commit fest when you
>>> think they are ready to be continued.
>>>
>>
>> Will do. Thanks for the feedback.
> 
> Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
> but I don't see a newer version posted?
> 

Ah, apologies - that's due to moving the patch from the last CF (it was
marked as RWF so I had to reopen it before moving it). I'll submit a new
version of the patch shortly, please mark it as WOA until then.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
David Steele
Дата:
Hi Tomas.

On 3/1/18 9:33 PM, Tomas Vondra wrote:
> On 03/02/2018 02:12 AM, Andres Freund wrote:
>> On 2018-02-01 23:51:55 +0100, Tomas Vondra wrote:
>>> On 02/01/2018 03:51 PM, Peter Eisentraut wrote:
>>>> To close out this commit fest, I'm setting both of these patches as
>>>> returned with feedback, as there are apparently significant issues to be
>>>> addressed.  Feel free to move them to the next commit fest when you
>>>> think they are ready to be continued.
>>>>
>>>
>>> Will do. Thanks for the feedback.
>>
>> Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
>> but I don't see a newer version posted?
>>
> 
> Ah, apologies - that's due to moving the patch from the last CF (it was
> marked as RWF so I had to reopen it before moving it). I'll submit a new
> version of the patch shortly, please mark it as WOA until then.

Marked as Waiting on Author.

-- 
-David
david@pgmasters.net


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Andres Freund
Дата:
Hi,

On 2018-03-01 21:39:36 -0500, David Steele wrote:
> On 3/1/18 9:33 PM, Tomas Vondra wrote:
> > On 03/02/2018 02:12 AM, Andres Freund wrote:
> > > Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
> > > but I don't see a newer version posted?
> > > 
> > 
> > Ah, apologies - that's due to moving the patch from the last CF (it was
> > marked as RWF so I had to reopen it before moving it). I'll submit a new
> > version of the patch shortly, please mark it as WOA until then.
> 
> Marked as Waiting on Author.

Sorry to be the hard-ass, but given this patch hasn't been moved forward
since 2018-01-19, I'm not sure why it's elegible to be in this CF in the
first place?

Greetings,

Andres Freund


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Robert Haas
Дата:
On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Ah, apologies - that's due to moving the patch from the last CF (it was
> marked as RWF so I had to reopen it before moving it). I'll submit a new
> version of the patch shortly, please mark it as WOA until then.

So, the way it's supposed to work is you resubmit the patch first and
then re-activate the CF entry.  If you get to re-activate the CF entry
without actually updating the patch, and then submit the patch
afterwards, then the CF deadline becomes largely meaningless.  I think
a new patch should rejected as untimely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
David Steele
Дата:
On 3/2/18 3:06 PM, Robert Haas wrote:
> On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Ah, apologies - that's due to moving the patch from the last CF (it was
>> marked as RWF so I had to reopen it before moving it). I'll submit a new
>> version of the patch shortly, please mark it as WOA until then.
> 
> So, the way it's supposed to work is you resubmit the patch first and
> then re-activate the CF entry.  If you get to re-activate the CF entry
> without actually updating the patch, and then submit the patch
> afterwards, then the CF deadline becomes largely meaningless.  I think
> a new patch should rejected as untimely.

Hmmm, I missed that implication last night.  I'll mark this Returned
with Feedback.

Tomas, please move to the next CF once you have an updated patch.

Thanks,
-- 
-David
david@pgmasters.net


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On 03/02/2018 09:21 PM, David Steele wrote:
> On 3/2/18 3:06 PM, Robert Haas wrote:
>> On Thu, Mar 1, 2018 at 9:33 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Ah, apologies - that's due to moving the patch from the last CF (it was
>>> marked as RWF so I had to reopen it before moving it). I'll submit a new
>>> version of the patch shortly, please mark it as WOA until then.
>>
>> So, the way it's supposed to work is you resubmit the patch first and
>> then re-activate the CF entry.  If you get to re-activate the CF entry
>> without actually updating the patch, and then submit the patch
>> afterwards, then the CF deadline becomes largely meaningless.  I think
>> a new patch should rejected as untimely.
> 
> Hmmm, I missed that implication last night.  I'll mark this Returned
> with Feedback.
> 
> Tomas, please move to the next CF once you have an updated patch.
> 

Can you guys please point me to the CF rules that say this? Because my
understanding (and not just mine, AFAICS) was obviously different.
Clearly there's a disconnect somewhere.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
Hi there,

attached is an updated patch fixing all the reported issues (a bit more
about those below).

The main change in this patch version is reworked logging of subxact
assignments, which needs to be done immediately for incremental decoding
to work properly.

The previous patch versions did that by logging a separate xlog record,
which however had rather noticeable space overhead (~40% on a worst-case
test - tiny table, no FPWs, ...). While in practice the overhead would
be much closer to 0%, it still seemed unacceptable.

Andres proposed doing something like we do with replication origins in
XLogRecordAssemble, i.e. inventing a special block, and embedding the
assignment info into that (in the next xlog record). This turned out to
be working quite well, and the worst-case space overhead dropped to ~5%.

I have attempted to do something like that with the invalidations, which
is the other thing that needs to be logged immediately for incremental
decoding to work correctly. The plan was to use the same approach as for
assignments, i.e. embed the invalidations into the next xlog record and
stop sending them in the commit message. That however turned out to be
much more complicated - the embedding is fairly trivial, of course, but
unlike assignments the invalidations are needed for hot standbys. If we
only send them incrementally, I think the standby would have to collect
from the WAL records, and store them in a way that survives restarts.

So for invalidations the patch uses the original approach with a new
type xlog record type (ignored by standby), and still logging the
invalidations in commit record (which is that the standby relies on).


On 02/01/2018 11:50 PM, Tomas Vondra wrote:
> On 01/31/2018 07:53 AM, Masahiko Sawada wrote:
> ...
>> ----
>> CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
>> on publisher side when starting replication. Probably we should check
>> the value on the subscriber side as well.
>>

Added.

>> ----
>> When streaming = on, if we drop subscription in the middle of
>> receiving stream changes, DROP SUBSCRIPTION could leak tmp files
>> (.chages file and .subxacts file). Also it also happens when a
>> transaction on upstream aborted without abort record.
>>

Right. The files would get cleaned up eventually during restart (just
like other temporary files), but leaking them after DROP SUBSCRIPTION is
not cool. So I've added a simple tracking of files (or rather streamed
XIDs) in the worker, and clean them explicitly on exit.

>> ----
>> Since we can change both streaming option and work_mem option by ALTER
>> SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.
>>

Yep, I've added note that work_mem and streaming can also be changed.
Those changes won't be applied to the already running worker, though.

>> ----
>> If we create a subscription without any options, both
>> pg_subscription.substream and pg_subscription.subworkmem are set to
>> null. However, since GetSubscription are't aware of NULL we start the
>> replication with invalid options like follows.
>> LOG:  received replication command: START_REPLICATION SLOT "hoge_sub"
>> LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
>> publication_names '"hoge_pub"')
>>
>> I think we can set substream to false and subworkmem to -1 instead of
>> null, and then makes libpqrcv_startstreaming not set streaming option
>> if stream is -1.
>>

Good catch! I've done pretty much what you suggested here, i.e. store
-1/false instead and then handle that in libpqrcv_startstreaming.

>> ----
>> Some WARNING messages appeared. Maybe these are for debug purpose?
>>
>> WARNING:  updating stream stats 0x1c12ef8 4 3 65604
>> WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080
>>

Yeah, those should be removed.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On 03/02/2018 09:05 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-03-01 21:39:36 -0500, David Steele wrote:
>> On 3/1/18 9:33 PM, Tomas Vondra wrote:
>>> On 03/02/2018 02:12 AM, Andres Freund wrote:
>>>> Hm, this CF entry is marked as needs review as of 2018-03-01 12:54:48,
>>>> but I don't see a newer version posted?
>>>>
>>>
>>> Ah, apologies - that's due to moving the patch from the last CF (it was
>>> marked as RWF so I had to reopen it before moving it). I'll submit a new
>>> version of the patch shortly, please mark it as WOA until then.
>>
>> Marked as Waiting on Author.
> 
> Sorry to be the hard-ass, but given this patch hasn't been moved forward
> since 2018-01-19, I'm not sure why it's elegible to be in this CF in the
> first place?
> 

That is somewhat misleading, I think. You're right the last version was
submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
right at the end of the CF. So it's not like the patch was sitting there
with unresolved issues. Based on that review the patch was marked as RWF
and thus not moved to 2018-03 automatically.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Andres Freund
Дата:
On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
> That is somewhat misleading, I think. You're right the last version was
> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
> right at the end of the CF. So it's not like the patch was sitting there
> with unresolved issues. Based on that review the patch was marked as RWF
> and thus not moved to 2018-03 automatically.

I don't see how this changes anything.

- Andres


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On 03/03/2018 02:01 AM, Andres Freund wrote:
> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>> That is somewhat misleading, I think. You're right the last version
>> was submitted on 2018-01-19, but the next review arrived on
>> 2018-01-31, i.e. right at the end of the CF. So it's not like the
>> patch was sitting there with unresolved issues. Based on that
>> review the patch was marked as RWF and thus not moved to 2018-03
>> automatically.
> 
> I don't see how this changes anything.
> 

You've used "The patch hasn't moved forward since 2018-01-19," as an
argument why the patch is not eligible for 2018-03. I suggest that
argument is misleading, because patches generally do not move without
reviews, and it's difficult to respond to a review that arrives on the
last day of a commitfest.

Consider that without the review, the patch would end up with NR status,
and would be moved to the next CF automatically. Isn't that a bit weird?


kind regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Andres Freund
Дата:
On 2018-03-03 02:34:06 +0100, Tomas Vondra wrote:
> On 03/03/2018 02:01 AM, Andres Freund wrote:
> > On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
> >> That is somewhat misleading, I think. You're right the last version
> >> was submitted on 2018-01-19, but the next review arrived on
> >> 2018-01-31, i.e. right at the end of the CF. So it's not like the
> >> patch was sitting there with unresolved issues. Based on that
> >> review the patch was marked as RWF and thus not moved to 2018-03
> >> automatically.
> > 
> > I don't see how this changes anything.
> > 
> 
> You've used "The patch hasn't moved forward since 2018-01-19," as an
> argument why the patch is not eligible for 2018-03. I suggest that
> argument is misleading, because patches generally do not move without
> reviews, and it's difficult to respond to a review that arrives on the
> last day of a commitfest.
> 
> Consider that without the review, the patch would end up with NR status,
> and would be moved to the next CF automatically. Isn't that a bit weird?

Not sure I follow. The point is that nobody would have complained if
you'd moved the patch into this fest if you'd updated it *before* it
started?

Greetings,

Andres Freund


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
David Steele
Дата:
On 3/2/18 8:01 PM, Andres Freund wrote:
> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>> That is somewhat misleading, I think. You're right the last version was
>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
>> right at the end of the CF. So it's not like the patch was sitting there
>> with unresolved issues. Based on that review the patch was marked as RWF
>> and thus not moved to 2018-03 automatically.
> 
> I don't see how this changes anything.

I agree that things could be clearer, and Andres has produced a great
document that we can build on.  The old one had gotten a bit stale.

However, I think it's pretty obvious that a CF entry should be
accompanied with a patch.  It sounds like the timing was awkward but you
still had 28 days to produce a new patch.

I also notice that you submitted 7 patches in this CF but are reviewing
zero.

-- 
-David
david@pgmasters.net


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On 03/03/2018 02:37 AM, David Steele wrote:
> On 3/2/18 8:01 PM, Andres Freund wrote:
>> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>>> That is somewhat misleading, I think. You're right the last version was
>>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
>>> right at the end of the CF. So it's not like the patch was sitting there
>>> with unresolved issues. Based on that review the patch was marked as RWF
>>> and thus not moved to 2018-03 automatically.
>>
>> I don't see how this changes anything.
> 
> I agree that things could be clearer, and Andres has produced a great
> document that we can build on.  The old one had gotten a bit stale.
> 
> However, I think it's pretty obvious that a CF entry should be 
> accompanied with a patch. It sounds like the timing was awkward but
> you still had 28 days to produce a new patch.
> 

Based on internal discussion I'm not so sure about the "pretty obvious"
part. It certainly wasn't that obvious to me, otherwise I'd submit the
revised patch earlier - hindsight is 20/20.

> I also notice that you submitted 7 patches in this CF but are
> reviewing zero.
> 

I've volunteered to review a couple of patches at the FOSDEM Developer
Meeting - I thought Stephen was entering that into the CF app, not sure
where it got lost.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
David Steele
Дата:
On 3/2/18 8:54 PM, Tomas Vondra wrote:
> On 03/03/2018 02:37 AM, David Steele wrote:
>> On 3/2/18 8:01 PM, Andres Freund wrote:
>>> On 2018-03-03 02:00:46 +0100, Tomas Vondra wrote:
>>>> That is somewhat misleading, I think. You're right the last version was
>>>> submitted on 2018-01-19, but the next review arrived on 2018-01-31, i.e.
>>>> right at the end of the CF. So it's not like the patch was sitting there
>>>> with unresolved issues. Based on that review the patch was marked as RWF
>>>> and thus not moved to 2018-03 automatically.
>>>
>>> I don't see how this changes anything.
>>
>> I agree that things could be clearer, and Andres has produced a great
>> document that we can build on.  The old one had gotten a bit stale.
>>
>> However, I think it's pretty obvious that a CF entry should be 
>> accompanied with a patch. It sounds like the timing was awkward but
>> you still had 28 days to produce a new patch.
> 
> Based on internal discussion I'm not so sure about the "pretty obvious"
> part. It certainly wasn't that obvious to me, otherwise I'd submit the
> revised patch earlier - hindsight is 20/20.

Indeed it is.  Be assured that nobody takes pleasure in pushing patches,
but we have limited resources and must make some choices.

>> I also notice that you submitted 7 patches in this CF but are
>> reviewing zero.
> 
> I've volunteered to review a couple of patches at the FOSDEM Developer
> Meeting - I thought Stephen was entering that into the CF app, not sure
> where it got lost.

There are plenty of patches that need review, so go for it.

Regards,
-- 
-David
david@pgmasters.net


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2018-03-03 01:55, Tomas Vondra wrote:
> Hi there,
> 
> attached is an updated patch fixing all the reported issues (a bit more
> about those below).

Hi,

0007-Track-statistics-for-streaming-spilling.patch  won't apply.  All 
the other patches apply ok.

patch complaints with:

patching file doc/src/sgml/monitoring.sgml
patching file src/backend/catalog/system_views.sql
Hunk #1 succeeded at 734 (offset 2 lines).
patching file src/backend/replication/logical/reorderbuffer.c
patching file src/backend/replication/walsender.c
patching file src/include/catalog/pg_proc.h
Hunk #1 FAILED at 2903.
1 out of 1 hunk FAILED -- saving rejects to file 
src/include/catalog/pg_proc.h.rej
patching file src/include/replication/reorderbuffer.h
patching file src/include/replication/walsender_private.h
patching file src/test/regress/expected/rules.out
Hunk #1 succeeded at 1861 (offset 2 lines).

Attached the produced reject file.


thanks,

Erik Rijkers
Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:

On 03/03/2018 06:19 AM, Erik Rijkers wrote:
> On 2018-03-03 01:55, Tomas Vondra wrote:
>> Hi there,
>>
>> attached is an updated patch fixing all the reported issues (a bit more
>> about those below).
> 
> Hi,
> 
> 0007-Track-statistics-for-streaming-spilling.patch  won't apply.  All
> the other patches apply ok.
> 
> patch complaints with:
> 
> patching file doc/src/sgml/monitoring.sgml
> patching file src/backend/catalog/system_views.sql
> Hunk #1 succeeded at 734 (offset 2 lines).
> patching file src/backend/replication/logical/reorderbuffer.c
> patching file src/backend/replication/walsender.c
> patching file src/include/catalog/pg_proc.h
> Hunk #1 FAILED at 2903.
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/include/catalog/pg_proc.h.rej
> patching file src/include/replication/reorderbuffer.h
> patching file src/include/replication/walsender_private.h
> patching file src/test/regress/expected/rules.out
> Hunk #1 succeeded at 1861 (offset 2 lines).
> 
> Attached the produced reject file.
> 

Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
Attached is a rebased patch, fixing this.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Peter Eisentraut
Дата:
I think this patch is not going to be ready for PG11.

- It depends on some work in the thread "logical decoding of two-phase
transactions", which is still in progress.

- Various details in the logical_work_mem patch (0001) are unresolved.

- This being partially a performance feature, we haven't seen any
performance tests (e.g., which settings result in which latencies under
which workloads).

That said, the feature seems useful and desirable, and the
implementation makes sense.  There are documentation and tests.  But
there is a significant amount of design and coding work still necessary.

Attached is a fixup patch that I needed to make it compile.

The last two patches in your series (0008, 0009) are labeled as bug
fixes.  Would you like to argue that they should be applied
independently of the rest of the feature?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Konstantin Knizhnik
Дата:

On 11.01.2018 22:41, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>>
>> Currently, limiting the amount of memory consumed by logical decoding is
>> tricky (or you might say impossible) for several reasons:
> I would like to see some more discussion on this, but I think not a lot
> of people understand the details, so I'll try to write up an explanation
> here.  This code is also somewhat new to me, so please correct me if
> there are inaccuracies, while keeping in mind that I'm trying to simplify.
>
> The data in the WAL is written as it happens, so the changes belonging
> to different transactions are all mixed together.  One of the jobs of
> logical decoding is to reassemble the changes belonging to each
> transaction.  The top-level data structure for that is the infamous
> ReorderBuffer.  So as it reads the WAL and sees something about a
> transaction, it keeps a copy of that change in memory, indexed by
> transaction ID (ReorderBufferChange).  When the transaction commits, the
> accumulated changes are passed to the output plugin and then freed.  If
> the transaction aborts, then changes are just thrown away.
>
> So when logical decoding is active, a copy of the changes for each
> active transaction is kept in memory (once per walsender).
>
> More precisely, the above happens for each subtransaction.  When the
> top-level transaction commits, it finds all its subtransactions in the
> ReorderBuffer, reassembles everything in the right order, then invokes
> the output plugin.
>
> All this could end up using an unbounded amount of memory, so there is a
> mechanism to spill changes to disk.  The way this currently works is
> hardcoded, and this patch proposes to change that.
>
> Currently, when a transaction or subtransaction has accumulated 4096
> changes, it is spilled to disk.  When the top-level transaction commits,
> things are read back from disk to do the final processing mentioned above.
>
> This all works mostly fine, but you can construct some more extreme
> cases where this can blow up.
>
> Here is a mundane example.  Let's say a change entry takes 100 bytes (it
> might contain a new row, or an update key and some new column values,
> for example).  If you have 100 concurrent active sessions and no
> subtransactions, then logical decoding memory is bounded by 4096 * 100 *
> 100 = 40 MB (per walsender) before things spill to disk.
>
> Now let's say you are using a lot of subtransactions, because you are
> using PL functions, exception handling, triggers, doing batch updates.
> If you have 200 subtransactions on average per concurrent session, the
> memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
> (per walsender).  And so on.  If you have more concurrent sessions or
> larger changes or more subtransactions, you'll use much more than those
> 8 GB.  And if you don't have those 8 GB, then you're stuck at this point.
>
> That is the consideration when we record changes, but we also need
> memory when we do the final processing at commit time.  That is slightly
> less problematic because we only process one top-level transaction at a
> time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
> (without the concurrent sessions factor).
>
> So, this patch proposes to improve this as follows:
>
> - We compute the actual size of each ReorderBufferChange and keep a
> running tally for each transaction, instead of just counting the number
> of changes.
>
> - We have a configuration setting that allows us to change the limit
> instead of the hardcoded 4096.  The configuration setting is also in
> terms of memory, not in number of changes.
>
> - The configuration setting is for the total memory usage per decoding
> session, not per subtransaction.  (So we also keep a running tally for
> the entire ReorderBuffer.)
>
> There are two open issues with this patch:
>
> One, this mechanism only applies when recording changes.  The processing
> at commit time still uses the previous hardcoded mechanism.  The reason
> for this is, AFAIU, that as things currently work, you have to have all
> subtransactions in memory to do the final processing.  There are some
> proposals to change this as well, but they are more involved.  Arguably,
> per my explanation above, memory use at commit time is less likely to be
> a problem.
>
> Two, what to do when the memory limit is reached.  With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit.  Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way.  The proposed patch searches
> through the entire list of transactions to find the largest one.  But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
>
> (a) would create more overhead for the case where everything fits into
> memory, so it seems unattractive.  Some combination of (b) and (c) seems
> useful, but we'd have to come up with something concrete.
>
> Thoughts?
>

I am very sorry that I have not noticed this thread before.
Spilling to the file in reorder buffer is the main factor limiting speed 
of importing data in multimaster and shardman (sharding based on FDW 
with redundancy provided by LR).
This is why we think a lot about possible ways of addressing this issue.
Right now data of huge transaction is written to the disk three times 
before it is applied at replica. And obviously read also three times.
First it is saved in WAL, then spilled to the disk by reorder buffer and 
once again spilled to the disk at replica before assignment to the 
particular apply worker
(last one is specific of multimaster, which can apply received 
transactions concurrently).

We considered three different approaches:
1. Streaming. It is similar with the proposed patch, the main difference 
is that we do not want to spill transaction in temporary file at 
replica, but apply it immediately in separate backend and abort 
transaction if it is aborted at master. Certainly it will work only with 
2PC.
2. Elimination of spilling by rescanning WAL.
3. Bypass WAL: add hooks to heapam to buffer and propagate changes 
immediately to replica and apply them in dedicated backend.
I have implemented prototype of such replication. With one replica it 
shows about 1.5x slowdown comparing with standalone/async LR and about 
2-3 improvement comparing with sync LR. For two replicas result is 2x 
slower than async LR and 2-8 times faster than sync LR (depending on 
number of concurrent connections).

Approach 3) seems to be specific to multimaster/shardman, so most likely 
it can not be considered for general LR.
So I want to compare 1 and 2. Did you ever though about something like 2?

Right now in the proposed patch you just move spilling to the file from 
master to replica.
It still can make sense to avoid memory overflow and reduce disk IO at 
master.
But if we have just one huge transaction (COPY) importing gigabytes of 
data to the database,
then performance will be almost the same with your patch or without it.
The only difference is where we serialize transaction: at master or at 
replica side.
In this sense this patch doesn't solve the problem with slow load of 
large bulks of data though LR.

Alternatively (approach 2) we can have small in-memory buffer for 
decoding transaction and remember LSN and snapshot of this transaction 
start.
In case of buffer overflow we just continue WAL traversal until we reach
end of the transaction. After it we restart scanning WAL from the 
beginning of this transaction at this second pass
send changes directly to the output plugin. So we have to scan WAL 
several times but do not need to spill anything to the disk neither at 
publisher, neither at subscriber side.
Certainly this approach will be inefficient if we have several long 
interleaving transactions. But in most customer's use cases we have 
observed until now there is just one huge transaction performing bulk load.
May be I missed something, but this approach seems to be easier for 
implementation than transaction streaming. And it doesn't require any 
changes in output plugin API.
I realize that it is a little bit late to ask this question once your 
patch is almost ready, but what do you think about it? Are there some 
pitfals with this approach?

There is one more aspect and performance problem with LR we have faced 
with shardman: if there are several publications for different subsets 
of table at one instance,
then WAL senders have to do a lot of useless work. Them are decoding 
transactions which have no relation to this publication. But WAL sender 
doesn't know it until it reaches the end of this transaction. What is 
worser: if transaction is huge, then all WAL senders will spill it to 
the disk even through only one of them actually needs it. So data of 
huge transaction is written not three times, but N times, where N is 
number of publications. The only solution of the problem we can imagine 
is to let backend somehow inform WAL sender (through shared message queue?)
about LSN-s it should considered. In this case WAL sender can skip large 
portions of WAL without decoding. We also want to know opinion of 
2ndQuandarnt about this idea.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Peter Eisentraut
Дата:
This patch set was not updated for the 2018-07 commitfest, so moved to -09.


On 09.03.18 17:07, Peter Eisentraut wrote:
> I think this patch is not going to be ready for PG11.
> 
> - It depends on some work in the thread "logical decoding of two-phase
> transactions", which is still in progress.
> 
> - Various details in the logical_work_mem patch (0001) are unresolved.
> 
> - This being partially a performance feature, we haven't seen any
> performance tests (e.g., which settings result in which latencies under
> which workloads).
> 
> That said, the feature seems useful and desirable, and the
> implementation makes sense.  There are documentation and tests.  But
> there is a significant amount of design and coding work still necessary.
> 
> Attached is a fixup patch that I needed to make it compile.
> 
> The last two patches in your series (0008, 0009) are labeled as bug
> fixes.  Would you like to argue that they should be applied
> independently of the rest of the feature?
> 


-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Michael Paquier
Дата:
On Sat, Mar 03, 2018 at 03:52:40PM +0100, Tomas Vondra wrote:
> Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
> Attached is a rebased patch, fixing this.

The latest patch set does not apply anymore, and had no activity for the
last two months, so I am marking it as returned with feedback.
--
Michael

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
Hi,

Attached is an updated version of this patch series. It's meant to be
applied on top of the 2pc decoding patch [1], because streaming of
in-progress transactions requires handling of concurrent aborts. So it
may or may not apply directly to master, I'm not sure - unfortunately
that's likely to confuse the cputube thing, but I don't want to include
the 2pc decoding bits here because that would be just confusing.

If needed, the part introducing logical_work_mem limit for ReorderBuffer
can be separated and committed independently, but I do expect this to be
committed after the 2pc decoding patch so I've left it like this.

This new version is mostly just a rebase to current master (or almost,
because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
but it also addresses the new stuff committed since last version (most
importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
subxact assignments, where the assignment was included in records with
XID=0, essentially failing to track the subxact properly.

For the logical_work_mem part, I think this is quite solid. The main
question is how to pick transactions for eviction. For now it uses the
same approach as master (i.e. picking the largest top-level transaction,
although measured by amount of memory and not just number of changes).

But I've realized that may not work with Generation context that great,
because unlike AllocSet it does not reuse the memory. That's nice as it
allows freeing old blocks (which AllocSet can't), but it means a small
transaction can have a change on old blocks preventing free(). That is
something we have in pg11 already, because that's where Generation
context got introduced - I haven't seen this issue in practice, but we
might need to do something about it.

In any case, I'm thinking we may need to pick a different eviction
algorithm - say using a transaction with the oldest change (and loop
until we release at least one block in the Generation context), or maybe
look for block mixing changes from the smallest number of transactions,
or something like that. Other ideas are welcome. I don't think the exact
algorithm is particularly critical, because it's meant to be triggered
only very rarely (i.e. pick logical_work_mem high enough).

The in-progress streaming is mostly mechanical extension of existing
functionality (new methods in various APIs, ...) and refactoring of
ReorderBuffer to handle incremental decoding. I'm sure it'd benefit from
reviews, of course.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
FWIW the original CF entry in 2018-07 [1] was marked as RWF. I'm not
sure what's the right way to resubmit such patches, so I've created a
new entry in 2019-01 [2] referencing the same hackers thread (and with
the same authors/reviewers metadata).

[1] https://commitfest.postgresql.org/19/1429/
[2] https://commitfest.postgresql.org/21/1927/

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alexey Kondratov
Дата:
Hi Tomas,

> This new version is mostly just a rebase to current master (or almost,
> because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
> but it also addresses the new stuff committed since last version (most
> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
> subxact assignments, where the assignment was included in records with
> XID=0, essentially failing to track the subxact properly.

I started reviewing your patch about a month ago and tried to do an 
in-depth review, since I am very interested in this patch too. The new 
version is not applicable to master 29180e5d78, but everything is OK 
after applying 2pc patch before. Anyway, I guess it may complicate 
further testing and review, since any potential reviewer has to take 
into account both patches at once. Previous version was applicable to 
master and was working fine for me separately (excepting a few 
patch-specific issues, which I try to explain below).


Patch review
========

First of all, I want to say thank you for such a huge work done. Here 
are some problems, which I have found and hopefully fixed with my 
additional patch (please, find attached, it should be applicable to the 
last commit of your newest patch version):

1) The most important issue is that your tap tests were broken—there was 
missing option "WITH (streaming=true)" in the subscription creating 
statement. Therefore, spilling mechanism has been tested rather than 
streaming.

2) After fixing tests the first one with simple streaming is immediately 
failed, because of logical replication worker segmentation fault. It 
happens, since worker tries to call stream_cleanup_files inside 
stream_open_file at the stream start, while nxids is zero, then it goes 
to the negative value and everything crashes. Something similar may 
happen with xids array, so I added two checks there.

3) The next problem is much more critical and is dedicated to historic 
MVCC visibility rules. Previously, walsender was starting to decode 
transaction on commit and we were able to resolve all xmin, xmax, 
combocids to cmin/cmax, build tuplecids hash and so on, but now we start 
doing all these things on the fly.

Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC 
is trying to validate catalog tuples, which are currently in the future 
relatively to the current decoder position inside transaction, e.g. we 
may want to resolve cmin/cmax of a tuple, which was created with cid 3 
and deleted with cid 5, while we are currently at cid 4, so our 
tuplecids hash is not complete to handle such a case.

I have updated HeapTupleSatisfiesHistoricMVCC visibility rules with two 
options:

/*
  * If we accidentally see a tuple from our transaction, but cannot 
resolve its
  * cmin, so probably it is from the future, thus drop it.
  */
if (!resolved)
     return false;

and

/*
  * If we accidentally see a tuple from our transaction, but cannot 
resolve its
  * cmax or cmax == InvalidCommandId, so probably it is still valid, 
thus accept it.
  */
if (!resolved || cmax == InvalidCommandId)
     return true;

4) There was a problem with marking top-level transaction as having 
catalog changes if one of its subtransactions has. It was causing a 
problem with DDL statements just after subtransaction start (savepoint), 
so data from new columns is not replicated.

5) Similar issue with schema send. You send schema only once per each 
sub/transaction (IIRC), while we have to update schema on each catalog 
change: invalidation execution, snapshot rebuild, adding new tuple cids. 
So I ended up with adding is_schema_send flag to ReorderBufferTXN, since 
it is easy to set it inside RB and read in the output plugin. Probably, 
we have to choose a better place for this flag.

6) To better handle all these tricky cases I added new tap 
test—014_stream_tough_ddl.pl—which consist of really tough combination 
of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.

I marked all my fixes and every questionable place with comment and 
"TOCHECK:" label for easy search. Removing of pretty any of these fixes 
leads to the tests fail due to the segmentation fault or replication 
mismatch. Though I mostly read and tested old version of patch, but 
after a quick look it seems that all these fixes are applicable to the 
new version of patch as well.


Performance
========

I have also performed a series of performance tests, and found that 
patch adds a huge overhead in the case of a large transaction consisting 
of many small rows, e.g.:

CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double 
precision);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
SELECT round(random()*10), random(), random()*142
FROM generate_series(1, 1000000) s(i);

Execution Time: 2407.709 ms
Total Time: 11494,238 ms (00:11,494)

With synchronous_standby_names and 64 MB logical_work_mem it takes up to 
x5 longer, while without patch it is about x2. Thus, logical replication 
streaming is approximately x4 as slower for similar transactions.

However, dealing with large transactions consisting of a small number of 
large rows is much better:

CREATE TABLE large_text (t TEXT);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 125);

Execution Time: 3545.642 ms
Total Time: 7678,617 ms (00:07,679)

It is around the same x2 as without patch. If someone is interested I 
also added flamegraphs of walsender and logical replication worker 
during first numerical transaction processing.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
Hi Alexey,

Thanks for the thorough and extremely valuable review!

On 12/17/18 5:23 PM, Alexey Kondratov wrote:
> Hi Tomas,
> 
>> This new version is mostly just a rebase to current master (or almost,
>> because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
>> but it also addresses the new stuff committed since last version (most
>> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
>> subxact assignments, where the assignment was included in records with
>> XID=0, essentially failing to track the subxact properly.
> 
> I started reviewing your patch about a month ago and tried to do an
> in-depth review, since I am very interested in this patch too. The new
> version is not applicable to master 29180e5d78, but everything is OK
> after applying 2pc patch before. Anyway, I guess it may complicate
> further testing and review, since any potential reviewer has to take
> into account both patches at once. Previous version was applicable to
> master and was working fine for me separately (excepting a few
> patch-specific issues, which I try to explain below).
> 

I agree it's somewhat annoying, but I don't think there's a better way,
unfortunately. Decoding in-progress transactions does require safe
handling of concurrent aborts, so it has to be committed after the 2pc
decoding patch (which makes that possible). But the 2pc patch also
touches the same places as this patch series (it reworks the reorder
buffer for example).

> 
> Patch review
> ========
> 
> First of all, I want to say thank you for such a huge work done. Here
> are some problems, which I have found and hopefully fixed with my
> additional patch (please, find attached, it should be applicable to the
> last commit of your newest patch version):
> 
> 1) The most important issue is that your tap tests were broken—there was
> missing option "WITH (streaming=true)" in the subscription creating
> statement. Therefore, spilling mechanism has been tested rather than
> streaming.
> 

D'oh!

> 2) After fixing tests the first one with simple streaming is immediately
> failed, because of logical replication worker segmentation fault. It
> happens, since worker tries to call stream_cleanup_files inside
> stream_open_file at the stream start, while nxids is zero, then it goes
> to the negative value and everything crashes. Something similar may
> happen with xids array, so I added two checks there.
> 
> 3) The next problem is much more critical and is dedicated to historic
> MVCC visibility rules. Previously, walsender was starting to decode
> transaction on commit and we were able to resolve all xmin, xmax,
> combocids to cmin/cmax, build tuplecids hash and so on, but now we start
> doing all these things on the fly.
> 
> Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC
> is trying to validate catalog tuples, which are currently in the future
> relatively to the current decoder position inside transaction, e.g. we
> may want to resolve cmin/cmax of a tuple, which was created with cid 3
> and deleted with cid 5, while we are currently at cid 4, so our
> tuplecids hash is not complete to handle such a case.
> 

Damn it! I ran into those two issues some time ago and I fixed it, but
I've forgotten to merge that fix into the patch. I'll merge those fixed
and compare them to your proposed fix, and send a new version tomorrow.

> 
> 4) There was a problem with marking top-level transaction as having
> catalog changes if one of its subtransactions has. It was causing a
> problem with DDL statements just after subtransaction start (savepoint),
> so data from new columns is not replicated.
> 
> 5) Similar issue with schema send. You send schema only once per each
> sub/transaction (IIRC), while we have to update schema on each catalog
> change: invalidation execution, snapshot rebuild, adding new tuple cids.
> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
> it is easy to set it inside RB and read in the output plugin. Probably,
> we have to choose a better place for this flag.
> 

Hmm. Can you share an example how to trigger these issues?

> 6) To better handle all these tricky cases I added new tap
> test—014_stream_tough_ddl.pl—which consist of really tough combination
> of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.
> 

Thanks!

> I marked all my fixes and every questionable place with comment and
> "TOCHECK:" label for easy search. Removing of pretty any of these fixes
> leads to the tests fail due to the segmentation fault or replication
> mismatch. Though I mostly read and tested old version of patch, but
> after a quick look it seems that all these fixes are applicable to the
> new version of patch as well.
> 

Thanks. I'll go through your patch tomorrow.

> 
> Performance
> ========
> 
> I have also performed a series of performance tests, and found that
> patch adds a huge overhead in the case of a large transaction consisting
> of many small rows, e.g.:
> 
> CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double
> precision);
> 
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
> SELECT round(random()*10), random(), random()*142
> FROM generate_series(1, 1000000) s(i);
> 
> Execution Time: 2407.709 ms
> Total Time: 11494,238 ms (00:11,494)
> 
> With synchronous_standby_names and 64 MB logical_work_mem it takes up to
> x5 longer, while without patch it is about x2. Thus, logical replication
> streaming is approximately x4 as slower for similar transactions.
> 
> However, dealing with large transactions consisting of a small number of
> large rows is much better:
> 
> CREATE TABLE large_text (t TEXT);
> 
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 1000000)) FROM generate_series(1, 125);
> 
> Execution Time: 3545.642 ms
> Total Time: 7678,617 ms (00:07,679)
> 
> It is around the same x2 as without patch. If someone is interested I
> also added flamegraphs of walsender and logical replication worker
> during first numerical transaction processing.
> 

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).

I'll investigate this (probably not this week), but in general it's good
to keep in mind a couple of things:

1) Some overhead is expected, due to doing things incrementally.

2) The memory limit should be set to sufficiently high value to be hit
only infrequently.

3) And when the limit is actually hit, it's an alternative to spilling
large amounts of data locally (to disk) or incurring significant
replication lag later.

So I'm not particularly worried, but I'll look into that. I'd be much
more worried if there was measurable overhead in cases when there's no
streaming happening (either because it's disabled or the memory limit
was not hit).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alexey Kondratov
Дата:
On 18.12.2018 1:28, Tomas Vondra wrote:
>> 4) There was a problem with marking top-level transaction as having
>> catalog changes if one of its subtransactions has. It was causing a
>> problem with DDL statements just after subtransaction start (savepoint),
>> so data from new columns is not replicated.
>>
>> 5) Similar issue with schema send. You send schema only once per each
>> sub/transaction (IIRC), while we have to update schema on each catalog
>> change: invalidation execution, snapshot rebuild, adding new tuple cids.
>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
>> it is easy to set it inside RB and read in the output plugin. Probably,
>> we have to choose a better place for this flag.
>>
> Hmm. Can you share an example how to trigger these issues?

Test cases inside 014_stream_tough_ddl.pl and old ones (with 
streaming=true option added) should reproduce all these issues. In 
general, it happens in a txn like:

INSERT
SAVEPOINT
ALTER TABLE ... ADD COLUMN
INSERT

then the second insert may discover old version of catalog.

> Interesting. Any idea where does the extra overhead in this particular
> case come from? It's hard to deduce that from the single flame graph,
> when I don't have anything to compare it with (i.e. the flame graph for
> the "normal" case).

I guess that bottleneck is in disk operations. You can check 
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and 
writes (~26%) take around 35% of CPU time in summary. To compare, 
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same 
~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.

Therefore, probably you may write changes on receiver in bigger chunks, 
not each change separately.

> So I'm not particularly worried, but I'll look into that. I'd be much
> more worried if there was measurable overhead in cases when there's no
> streaming happening (either because it's disabled or the memory limit
> was not hit).

What I have also just found, is that if a table row is large enough to 
be TOASTed, e.g.:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

then logical_work_mem limit is not hit and we neither stream, nor spill 
to disk this transaction, while it is still large. In contrast, the 
transaction above (with 1000000 smaller rows) being comparable in size 
is streamed. Not sure, that it is easy to add proper accounting of 
TOAST-able columns, but it worth it.

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
Hi Alexey,

Attached is an updated version of the patches, with all the fixes I've
done in the past. I believe it should fix at least some of the issues
you reported - certainly the problem with stream_cleanup_files, but
perhaps some of the other issues too.

I'm a bit confused by the changes to TAP tests. Per the patch summary,
some .pl files get renamed (nor sure why), a new one is added, etc. So
I've instead enabled streaming subscriptions in all tests, which with
this patch produces two failures:

Test Summary Report
-------------------
t/004_sync.pl                    (Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 7 tests but ran 1.
t/011_stream_ddl.pl              (Wstat: 256 Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1

So yeah, there's more stuff to fix. But I can't directly apply your
fixes because the updated patches are somewhat different.


On 12/18/18 3:07 PM, Alexey Kondratov wrote:
> On 18.12.2018 1:28, Tomas Vondra wrote:
>>> 4) There was a problem with marking top-level transaction as having
>>> catalog changes if one of its subtransactions has. It was causing a
>>> problem with DDL statements just after subtransaction start (savepoint),
>>> so data from new columns is not replicated.
>>>
>>> 5) Similar issue with schema send. You send schema only once per each
>>> sub/transaction (IIRC), while we have to update schema on each catalog
>>> change: invalidation execution, snapshot rebuild, adding new tuple cids.
>>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
>>> it is easy to set it inside RB and read in the output plugin. Probably,
>>> we have to choose a better place for this flag.
>>>
>> Hmm. Can you share an example how to trigger these issues?
> 
> Test cases inside 014_stream_tough_ddl.pl and old ones (with
> streaming=true option added) should reproduce all these issues. In
> general, it happens in a txn like:
> 
> INSERT
> SAVEPOINT
> ALTER TABLE ... ADD COLUMN
> INSERT
> 
> then the second insert may discover old version of catalog.
> 

Yeah, that's the issue I've discovered before and thought it got fixed.

>> Interesting. Any idea where does the extra overhead in this particular
>> case come from? It's hard to deduce that from the single flame graph,
>> when I don't have anything to compare it with (i.e. the flame graph for
>> the "normal" case).
> 
> I guess that bottleneck is in disk operations. You can check
> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
> writes (~26%) take around 35% of CPU time in summary. To compare,
> please, see attached flame graph for the following transaction:
> 
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
> 
> Execution Time: 44519.816 ms
> Time: 98333,642 ms (01:38,334)
> 
> where disk IO is only ~7-8% in total. So we get very roughly the same
> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
> 
> Therefore, probably you may write changes on receiver in bigger chunks,
> not each change separately.
> 

Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

>> So I'm not particularly worried, but I'll look into that. I'd be much
>> more worried if there was measurable overhead in cases when there's no
>> streaming happening (either because it's disabled or the memory limit
>> was not hit).
> 
> What I have also just found, is that if a table row is large enough to
> be TOASTed, e.g.:
> 
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> 
> then logical_work_mem limit is not hit and we neither stream, nor spill
> to disk this transaction, while it is still large. In contrast, the
> transaction above (with 1000000 smaller rows) being comparable in size
> is streamed. Not sure, that it is easy to add proper accounting of
> TOAST-able columns, but it worth it.
> 

That's certainly strange and possibly a bug in the memory accounting
code. I'm not sure why would that happen, though, because TOAST data
look just like regular INSERT changes. Interesting. I wonder if it's
already fixed in this updated version, but it's a bit too late to
investigate that today.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alexey Kondratov
Дата:
Hi Tomas,

> I'm a bit confused by the changes to TAP tests. Per the patch summary,
> some .pl files get renamed (nor sure why), a new one is added, etc.

I added new tap test case, streaming=true option inside old stream_* 
ones and incremented streaming tests number (+2) because of the 
collision between 009_matviews.pl / 009_stream_simple.pl and 
010_truncate.pl / 010_stream_subxact.pl. At least in the previous 
version of the patch they were under the same numbers. Nothing special, 
but for simplicity, please, find attached my new tap test separately.

>   So
> I've instead enabled streaming subscriptions in all tests, which with
> this patch produces two failures:
>
> Test Summary Report
> -------------------
> t/004_sync.pl                    (Wstat: 7424 Tests: 1 Failed: 0)
>    Non-zero exit status: 29
>    Parse errors: Bad plan.  You planned 7 tests but ran 1.
> t/011_stream_ddl.pl              (Wstat: 256 Tests: 2 Failed: 1)
>    Failed test:  2
>    Non-zero exit status: 1
>
> So yeah, there's more stuff to fix. But I can't directly apply your
> fixes because the updated patches are somewhat different.

Fixes should apply clearly to the previous version of your patch. Also, 
I am not sure, that it is a good idea to simply enable streaming 
subscriptions in all tests (e.g. pre streaming patch t/004_sync.pl), 
since then they do not hit not streaming code.

>>> Interesting. Any idea where does the extra overhead in this particular
>>> case come from? It's hard to deduce that from the single flame graph,
>>> when I don't have anything to compare it with (i.e. the flame graph for
>>> the "normal" case).
>> I guess that bottleneck is in disk operations. You can check
>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>> writes (~26%) take around 35% of CPU time in summary. To compare,
>> please, see attached flame graph for the following transaction:
>>
>> INSERT INTO large_text
>> SELECT (SELECT string_agg('x', ',')
>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>
>> Execution Time: 44519.816 ms
>> Time: 98333,642 ms (01:38,334)
>>
>> where disk IO is only ~7-8% in total. So we get very roughly the same
>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
>>
>> Therefore, probably you may write changes on receiver in bigger chunks,
>> not each change separately.
>>
> Possibly, I/O is certainly a possible culprit, although we should be
> using buffered I/O and there certainly are not any fsyncs here. So I'm
> not sure why would it be cheaper to do the writes in batches.
>
> BTW does this mean you see the overhead on the apply side? Or are you
> running this on a single machine, and it's difficult to decide?

I run this on a single machine, but walsender and worker are utilizing 
almost 100% of CPU per each process all the time, and at apply side I/O 
syscalls take about 1/3 of CPU time. Though I am still not sure, but for 
me this result somehow links performance drop with problems at receiver 
side.

Writing in batches was just a hypothesis and to validate it I have 
performed test with large txn, but consisting of a smaller number of 
wide rows. This test does not exhibit any significant performance drop, 
while it was streamed too. So it seems to be valid. Anyway, I do not 
have other reasonable ideas beside that right now.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
Hi,

Attached is an updated patch series, merging fixes and changes to TAP
tests proposed by Alexey. I've merged the fixes into the appropriate
patches, and I've kept the TAP changes / new tests as separate patches
towards the end of the series.

I'm a bit unhappy with two aspects of the current patch series:

1) We now track schema changes in two ways - using the pre-existing
schema_sent flag in RelationSyncEntry, and the (newly added) flag in
ReorderBuffer. While those options are used for regular vs. streamed
transactions, fundamentally it's the same thing and so having two
competing ways seems like a bad idea. Not sure what's the best way to
resolve this, though.

2) We've removed quite a few asserts, particularly ensuring sanity of
cmin/cmax values. To some extent that's expected, because by allowing
decoding of in-progress transactions relaxes some of those rules. But
I'd be much happier if some of those asserts could be reinstated, even
if only in a weaker form.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Michael Paquier
Дата:
On Mon, Jan 14, 2019 at 07:23:31PM +0100, Tomas Vondra wrote:
> Attached is an updated patch series, merging fixes and changes to TAP
> tests proposed by Alexey. I've merged the fixes into the appropriate
> patches, and I've kept the TAP changes / new tests as separate patches
> towards the end of the series.

Patch 4 of the latest set fails to apply, so I have moved the patch to
next CF, waiting on author.
--
Michael

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alexey Kondratov
Дата:
Hi Tomas,

On 14.01.2019 21:23, Tomas Vondra wrote:
> Attached is an updated patch series, merging fixes and changes to TAP
> tests proposed by Alexey. I've merged the fixes into the appropriate
> patches, and I've kept the TAP changes / new tests as separate patches
> towards the end of the series.

I had problems applying this patch along with 2pc streaming one to the 
current master, but everything applied well on 97c39498e5. Regression 
tests pass. What I personally do not like in the current TAP tests set 
is that you have added "WITH (streaming=on)" to all tests including old 
non-streaming ones. It seems unclear, which mechanism is tested there: 
streaming, but those transactions probably do not hit memory limit, so 
it depends on default server parameters; or non-streaming, but then what 
is the need for (streaming=on)? I would prefer to add (streaming=on) 
only to the new tests, where it is clearly necessary.

> I'm a bit unhappy with two aspects of the current patch series:
>
> 1) We now track schema changes in two ways - using the pre-existing
> schema_sent flag in RelationSyncEntry, and the (newly added) flag in
> ReorderBuffer. While those options are used for regular vs. streamed
> transactions, fundamentally it's the same thing and so having two
> competing ways seems like a bad idea. Not sure what's the best way to
> resolve this, though.

Yes, sure, when I have found problems with streaming of extensive DDL, I 
added new flag in the simplest way, and it worked. Now, old schema_sent 
flag is per relation based, while the new one - is_schema_sent - is per 
top-level transaction based. If I get it correctly, the former seems to 
be more thrifty, since new schema is sent only if we are streaming 
change for relation, whose schema is outdated. In contrast, in the 
latter case we will send new schema even if there will be no new changes 
which belong to this relation.

I guess, it would be better to stick to the old behavior. I will try to 
investigate how to better use it in the streaming mode as well.

> 2) We've removed quite a few asserts, particularly ensuring sanity of
> cmin/cmax values. To some extent that's expected, because by allowing
> decoding of in-progress transactions relaxes some of those rules. But
> I'd be much happier if some of those asserts could be reinstated, even
> if only in a weaker form.


Asserts have been removed from two places: (1) 
HeapTupleSatisfiesHistoricMVCC, which seems inevitable, since we are 
touching the essence of the MVCC visibility rules, when trying to decode 
an in-progress transaction, and (2) ReorderBufferBuildTupleCidHash, 
which is probably not related directly to the topic of the ongoing 
patch, since Arseny Sher faced the same issue with simple repetitive DDL 
decoding [1] recently.

Not many, but I agree, that replacing them with some softer asserts 
would be better, than just removing, especially point 1).


[1] https://www.postgresql.org/message-id/flat/874l9p8hyw.fsf%40ars-thinkpad


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alexey Kondratov
Дата:
Hi Tomas,

>>>> Interesting. Any idea where does the extra overhead in this particular
>>>> case come from? It's hard to deduce that from the single flame graph,
>>>> when I don't have anything to compare it with (i.e. the flame graph 
>>>> for
>>>> the "normal" case).
>>> I guess that bottleneck is in disk operations. You can check
>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>> please, see attached flame graph for the following transaction:
>>>
>>> INSERT INTO large_text
>>> SELECT (SELECT string_agg('x', ',')
>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>
>>> Execution Time: 44519.816 ms
>>> Time: 98333,642 ms (01:38,334)
>>>
>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for 
>>> tests.
>>>
>>> Therefore, probably you may write changes on receiver in bigger chunks,
>>> not each change separately.
>>>
>> Possibly, I/O is certainly a possible culprit, although we should be
>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>> not sure why would it be cheaper to do the writes in batches.
>>
>> BTW does this mean you see the overhead on the apply side? Or are you
>> running this on a single machine, and it's difficult to decide?
>
> I run this on a single machine, but walsender and worker are utilizing 
> almost 100% of CPU per each process all the time, and at apply side 
> I/O syscalls take about 1/3 of CPU time. Though I am still not sure, 
> but for me this result somehow links performance drop with problems at 
> receiver side.
>
> Writing in batches was just a hypothesis and to validate it I have 
> performed test with large txn, but consisting of a smaller number of 
> wide rows. This test does not exhibit any significant performance 
> drop, while it was streamed too. So it seems to be valid. Anyway, I do 
> not have other reasonable ideas beside that right now.

I've checked recently this patch again and tried to elaborate it in 
terms of performance. As a result I've implemented a new POC version of 
the applier (attached). Almost everything in streaming logic stayed 
intact, but apply worker is significantly different.

As I wrote earlier I still claim, that spilling changes on disk at the 
applier side adds additional overhead, but it is possible to get rid of 
it. In my additional patch I do the following:

1) Maintain a pool of additional background workers (bgworkers), that 
are connected with main logical apply worker via shm_mq's. Each worker 
is dedicated to the processing of specific streamed transaction.

2) When we receive a streamed change for some transaction, we check 
whether there is an existing dedicated bgworker in HTAB (xid -> 
bgworker), or there are some in the idle list, or spawn a new one.

3) We pass all changes (between STREAM START/STOP) to that bgworker via 
shm_mq_send without intermediate waiting. However, we wait for bgworker 
to apply the entire changes chunk at STREAM STOP, since we don't want 
transactions reordering.

4) When transaction is commited/aborted worker is being added to the 
idle list and is waiting for reassigning message.

5) I have used the same machinery with apply_dispatch in bgworkers, 
since most of actions are practically very similar.

Thus, we do not spill anything at the applier side, so transaction 
changes are processed by bgworkers as normal backends do. In the same 
time, changes processing is strictly serial, which prevents transactions 
reordering and possible conflicts/anomalies. Even though we trade off 
performance in favor of stability the result is rather impressive. I 
have used a similar query for testing as before:

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
     SELECT round(random()*10), random(), random()*142
     FROM generate_series(1, 1000000) s(i);

with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and 
synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is 
following:

CREATE TABLE large_test (
     id serial primary key,
     num1 bigint,
     num2 double precision,
     num3 double precision
);

Here are the results:

-------------------------------------------------------------------
| N | Time on master, sec | Total xact time, sec |     Ratio      |
-------------------------------------------------------------------
|                        On commit (master, v13)                  |
-------------------------------------------------------------------
| 1kk | 6.5               | 17.6                 | x2.74          |
-------------------------------------------------------------------
| 3kk | 21                | 55.4                 | x2.64          |
-------------------------------------------------------------------
| 5kk | 38.3              | 91.5                 | x2.39          |
-------------------------------------------------------------------
|                        Stream + spill                           |
-------------------------------------------------------------------
| 1kk | 5.9               | 18                   | x3             |
-------------------------------------------------------------------
| 3kk | 19.5              | 52.4                 | x2.7           |
-------------------------------------------------------------------
| 5kk | 33.3              | 86.7                 | x2.86          |
-------------------------------------------------------------------
|                        Stream + BGW pool                        |
-------------------------------------------------------------------
| 1kk | 6                 | 12                   | x2             |
-------------------------------------------------------------------
| 3kk | 18.5              | 30.5                 | x1.65          |
-------------------------------------------------------------------
| 5kk | 35.6              | 53.9                 | x1.51          |
-------------------------------------------------------------------

It seems that overhead added by synchronous replica is lower by 2-3 
times compared with Postgres master and streaming with spilling. 
Therefore, the original patch eliminated delay before large transaction 
processing start by sender, while this additional patch speeds up the 
applier side.

Although the overall speed up is surely measurable, there is a room for 
improvements yet:

1) Currently bgworkers are only spawned on demand without some initial 
pool and never stopped. Maybe we should create a small pool on 
replication start and offload some of idle bgworkers if they exceed some 
limit?

2) Probably we can track somehow that incoming change has conflicts with 
some of being processed xacts, so we can wait for specific bgworkers 
only in that case?

3) Since the communication between main logical apply worker and each 
bgworker from the pool is a 'single producer --- single consumer' 
problem, then probably it is possible to wait and set/check flags 
without locks, but using just atomics.

What do you think about this concept in general? Any concerns and 
criticism are welcome!


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

P.S. This patch shloud be applicable to your last patch set. I would rebase it against master, but it depends on 2pc
patch,that I don't know well enough.
 


Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Wed, Aug 28, 2019 at 08:17:47PM +0300, Alexey Kondratov wrote:
>Hi Tomas,
>
>>>>>Interesting. Any idea where does the extra overhead in this particular
>>>>>case come from? It's hard to deduce that from the single flame graph,
>>>>>when I don't have anything to compare it with (i.e. the flame 
>>>>>graph for
>>>>>the "normal" case).
>>>>I guess that bottleneck is in disk operations. You can check
>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>please, see attached flame graph for the following transaction:
>>>>
>>>>INSERT INTO large_text
>>>>SELECT (SELECT string_agg('x', ',')
>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>
>>>>Execution Time: 44519.816 ms
>>>>Time: 98333,642 ms (01:38,334)
>>>>
>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>~x4-5 performance drop here. JFYI, I am using a machine with SSD 
>>>>for tests.
>>>>
>>>>Therefore, probably you may write changes on receiver in bigger chunks,
>>>>not each change separately.
>>>>
>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>not sure why would it be cheaper to do the writes in batches.
>>>
>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>running this on a single machine, and it's difficult to decide?
>>
>>I run this on a single machine, but walsender and worker are 
>>utilizing almost 100% of CPU per each process all the time, and at 
>>apply side I/O syscalls take about 1/3 of CPU time. Though I am 
>>still not sure, but for me this result somehow links performance 
>>drop with problems at receiver side.
>>
>>Writing in batches was just a hypothesis and to validate it I have 
>>performed test with large txn, but consisting of a smaller number of 
>>wide rows. This test does not exhibit any significant performance 
>>drop, while it was streamed too. So it seems to be valid. Anyway, I 
>>do not have other reasonable ideas beside that right now.
>
>I've checked recently this patch again and tried to elaborate it in 
>terms of performance. As a result I've implemented a new POC version 
>of the applier (attached). Almost everything in streaming logic stayed 
>intact, but apply worker is significantly different.
>
>As I wrote earlier I still claim, that spilling changes on disk at the 
>applier side adds additional overhead, but it is possible to get rid 
>of it. In my additional patch I do the following:
>
>1) Maintain a pool of additional background workers (bgworkers), that 
>are connected with main logical apply worker via shm_mq's. Each worker 
>is dedicated to the processing of specific streamed transaction.
>
>2) When we receive a streamed change for some transaction, we check 
>whether there is an existing dedicated bgworker in HTAB (xid -> 
>bgworker), or there are some in the idle list, or spawn a new one.
>
>3) We pass all changes (between STREAM START/STOP) to that bgworker 
>via shm_mq_send without intermediate waiting. However, we wait for 
>bgworker to apply the entire changes chunk at STREAM STOP, since we 
>don't want transactions reordering.
>
>4) When transaction is commited/aborted worker is being added to the 
>idle list and is waiting for reassigning message.
>
>5) I have used the same machinery with apply_dispatch in bgworkers, 
>since most of actions are practically very similar.
>
>Thus, we do not spill anything at the applier side, so transaction 
>changes are processed by bgworkers as normal backends do. In the same 
>time, changes processing is strictly serial, which prevents 
>transactions reordering and possible conflicts/anomalies. Even though 
>we trade off performance in favor of stability the result is rather 
>impressive. I have used a similar query for testing as before:
>
>EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
>    SELECT round(random()*10), random(), random()*142
>    FROM generate_series(1, 1000000) s(i);
>
>with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and 
>synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is 
>following:
>
>CREATE TABLE large_test (
>    id serial primary key,
>    num1 bigint,
>    num2 double precision,
>    num3 double precision
>);
>
>Here are the results:
>
>-------------------------------------------------------------------
>| N | Time on master, sec | Total xact time, sec |     Ratio      |
>-------------------------------------------------------------------
>|                        On commit (master, v13)                  |
>-------------------------------------------------------------------
>| 1kk | 6.5               | 17.6                 | x2.74          |
>-------------------------------------------------------------------
>| 3kk | 21                | 55.4                 | x2.64          |
>-------------------------------------------------------------------
>| 5kk | 38.3              | 91.5                 | x2.39          |
>-------------------------------------------------------------------
>|                        Stream + spill                           |
>-------------------------------------------------------------------
>| 1kk | 5.9               | 18                   | x3             |
>-------------------------------------------------------------------
>| 3kk | 19.5              | 52.4                 | x2.7           |
>-------------------------------------------------------------------
>| 5kk | 33.3              | 86.7                 | x2.86          |
>-------------------------------------------------------------------
>|                        Stream + BGW pool                        |
>-------------------------------------------------------------------
>| 1kk | 6                 | 12                   | x2             |
>-------------------------------------------------------------------
>| 3kk | 18.5              | 30.5                 | x1.65          |
>-------------------------------------------------------------------
>| 5kk | 35.6              | 53.9                 | x1.51          |
>-------------------------------------------------------------------
>
>It seems that overhead added by synchronous replica is lower by 2-3 
>times compared with Postgres master and streaming with spilling. 
>Therefore, the original patch eliminated delay before large 
>transaction processing start by sender, while this additional patch 
>speeds up the applier side.
>
>Although the overall speed up is surely measurable, there is a room 
>for improvements yet:
>
>1) Currently bgworkers are only spawned on demand without some initial 
>pool and never stopped. Maybe we should create a small pool on 
>replication start and offload some of idle bgworkers if they exceed 
>some limit?
>
>2) Probably we can track somehow that incoming change has conflicts 
>with some of being processed xacts, so we can wait for specific 
>bgworkers only in that case?
>
>3) Since the communication between main logical apply worker and each 
>bgworker from the pool is a 'single producer --- single consumer' 
>problem, then probably it is possible to wait and set/check flags 
>without locks, but using just atomics.
>
>What do you think about this concept in general? Any concerns and 
>criticism are welcome!
>

Hi Alexey,

I'm unable to do any in-depth review of the patch over the next two weeks
or so, but I think the idea of having a pool of apply workers is sound and
can be quite beneficial for some workloads.

I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.

There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same commit
order, and introducing deadlocks that would not exist in single-process
apply.

Surely, we want to end up with the same commit order as on the upstream,
otherwise we might easily get different data on the subscriber. So when we
pass the large transaction to a separate process, then this process has
to wait for the other processes processing transactions that committed
first. And similarly, other processes have to wait for this process.
Depending on the commit order. I might have missed something, but I don't
see anything like that in your patch.

Essentially, this means there needs to be some sort of wait between those
apply processes, enforcing the commit order.

That however means we can easily introduce deadlocks into workloads where
the serial-apply would not have that issue - imagine multiple large
transactions, touching the same set of rows. We may ship them to different
bgworkers, and those processes may deadlock.

Of course, the deadlock detector will come around (assuming the wait is
done in a way visible to the detector) and will abort one of the
processes. But we don't know it'll abort the right one - it may easily
abort the apply process that needs to comit first, and eveyone else is
waitiing for it. Which stalls the apply forever.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alexey Kondratov
Дата:
On 28.08.2019 22:06, Tomas Vondra wrote:
>
>>
>>>>>> Interesting. Any idea where does the extra overhead in this 
>>>>>> particular
>>>>>> case come from? It's hard to deduce that from the single flame 
>>>>>> graph,
>>>>>> when I don't have anything to compare it with (i.e. the flame 
>>>>>> graph for
>>>>>> the "normal" case).
>>>>> I guess that bottleneck is in disk operations. You can check
>>>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>> please, see attached flame graph for the following transaction:
>>>>>
>>>>> INSERT INTO large_text
>>>>> SELECT (SELECT string_agg('x', ',')
>>>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>
>>>>> Execution Time: 44519.816 ms
>>>>> Time: 98333,642 ms (01:38,334)
>>>>>
>>>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD 
>>>>> for tests.
>>>>>
>>>>> Therefore, probably you may write changes on receiver in bigger 
>>>>> chunks,
>>>>> not each change separately.
>>>>>
>>>> Possibly, I/O is certainly a possible culprit, although we should be
>>>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>> not sure why would it be cheaper to do the writes in batches.
>>>>
>>>> BTW does this mean you see the overhead on the apply side? Or are you
>>>> running this on a single machine, and it's difficult to decide?
>>>
>>> I run this on a single machine, but walsender and worker are 
>>> utilizing almost 100% of CPU per each process all the time, and at 
>>> apply side I/O syscalls take about 1/3 of CPU time. Though I am 
>>> still not sure, but for me this result somehow links performance 
>>> drop with problems at receiver side.
>>>
>>> Writing in batches was just a hypothesis and to validate it I have 
>>> performed test with large txn, but consisting of a smaller number of 
>>> wide rows. This test does not exhibit any significant performance 
>>> drop, while it was streamed too. So it seems to be valid. Anyway, I 
>>> do not have other reasonable ideas beside that right now.
>>
>> It seems that overhead added by synchronous replica is lower by 2-3 
>> times compared with Postgres master and streaming with spilling. 
>> Therefore, the original patch eliminated delay before large 
>> transaction processing start by sender, while this additional patch 
>> speeds up the applier side.
>>
>> Although the overall speed up is surely measurable, there is a room 
>> for improvements yet:
>>
>> 1) Currently bgworkers are only spawned on demand without some 
>> initial pool and never stopped. Maybe we should create a small pool 
>> on replication start and offload some of idle bgworkers if they 
>> exceed some limit?
>>
>> 2) Probably we can track somehow that incoming change has conflicts 
>> with some of being processed xacts, so we can wait for specific 
>> bgworkers only in that case?
>>
>> 3) Since the communication between main logical apply worker and each 
>> bgworker from the pool is a 'single producer --- single consumer' 
>> problem, then probably it is possible to wait and set/check flags 
>> without locks, but using just atomics.
>>
>> What do you think about this concept in general? Any concerns and 
>> criticism are welcome!
>>
>

Hi Tomas,

Thank you for a quick response.

> I don't think it matters very much whether the workers are started at the
> beginning or allocated ad hoc, that's IMO a minor implementation detail.

OK, I had the same vision about this point. Any minor differences here 
will be neglectable for a sufficiently large transaction.

>
> There's one huge challenge that I however don't see mentioned in your
> message or in the patch (after cursory reading) - ensuring the same 
> commit
> order, and introducing deadlocks that would not exist in single-process
> apply.

Probably I haven't explained well this part, sorry for that. In my patch 
I don't use workers pool for a concurrent transaction apply, but rather 
for a fast context switch between long-lived streamed transactions. In 
other words we apply all changes arrived from the sender in a completely 
serial manner. Being written step-by-step it looks like:

1) Read STREAM START message and figure out the target worker by xid.

2) Put all changes, which belongs to this xact to the selected worker 
one by one via shm_mq_send.

3) Read STREAM STOP message and wait until our worker will apply all 
changes in the queue.

4) Process all other chunks of streamed xacts in the same manner.

5) Process all non-streamed xacts immediately in the main apply worker loop.

6) If we read STREAMED COMMIT/ABORT we again wait until selected worker 
either commits or aborts.

Thus, it automatically guaranties the same commit order on replica as on 
master. Yes, we loose some performance here, since we don't apply 
transactions concurrently, but it would bring all those problems you 
have described.

However, you helped me to figure out another point I have forgotten. 
Although we ensure commit order automatically, the beginning of streamed 
xacts may reorder. It happens if some small xacts have been commited on 
master since the streamed one started, because we do not start streaming 
immediately, but only after logical_work_mem hit. I have performed some 
tests with conflicting xacts and it seems that it's not a problem, since 
locking mechanism in Postgres guarantees that if there would some 
deadlocks, they will happen earlier on master. So if some records hit 
the WAL, it is safe to apply the sequentially. Am I wrong?

Anyway, I'm going to double check the safety of this part later.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:
>On 28.08.2019 22:06, Tomas Vondra wrote:
>>
>>>
>>>>>>>Interesting. Any idea where does the extra overhead in 
>>>>>>>this particular
>>>>>>>case come from? It's hard to deduce that from the single 
>>>>>>>flame graph,
>>>>>>>when I don't have anything to compare it with (i.e. the 
>>>>>>>flame graph for
>>>>>>>the "normal" case).
>>>>>>I guess that bottleneck is in disk operations. You can check
>>>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>>>please, see attached flame graph for the following transaction:
>>>>>>
>>>>>>INSERT INTO large_text
>>>>>>SELECT (SELECT string_agg('x', ',')
>>>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>>
>>>>>>Execution Time: 44519.816 ms
>>>>>>Time: 98333,642 ms (01:38,334)
>>>>>>
>>>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>>>~x4-5 performance drop here. JFYI, I am using a machine with 
>>>>>>SSD for tests.
>>>>>>
>>>>>>Therefore, probably you may write changes on receiver in 
>>>>>>bigger chunks,
>>>>>>not each change separately.
>>>>>>
>>>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>>>not sure why would it be cheaper to do the writes in batches.
>>>>>
>>>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>>>running this on a single machine, and it's difficult to decide?
>>>>
>>>>I run this on a single machine, but walsender and worker are 
>>>>utilizing almost 100% of CPU per each process all the time, and 
>>>>at apply side I/O syscalls take about 1/3 of CPU time. Though I 
>>>>am still not sure, but for me this result somehow links 
>>>>performance drop with problems at receiver side.
>>>>
>>>>Writing in batches was just a hypothesis and to validate it I 
>>>>have performed test with large txn, but consisting of a smaller 
>>>>number of wide rows. This test does not exhibit any significant 
>>>>performance drop, while it was streamed too. So it seems to be 
>>>>valid. Anyway, I do not have other reasonable ideas beside that 
>>>>right now.
>>>
>>>It seems that overhead added by synchronous replica is lower by 
>>>2-3 times compared with Postgres master and streaming with 
>>>spilling. Therefore, the original patch eliminated delay before 
>>>large transaction processing start by sender, while this 
>>>additional patch speeds up the applier side.
>>>
>>>Although the overall speed up is surely measurable, there is a 
>>>room for improvements yet:
>>>
>>>1) Currently bgworkers are only spawned on demand without some 
>>>initial pool and never stopped. Maybe we should create a small 
>>>pool on replication start and offload some of idle bgworkers if 
>>>they exceed some limit?
>>>
>>>2) Probably we can track somehow that incoming change has 
>>>conflicts with some of being processed xacts, so we can wait for 
>>>specific bgworkers only in that case?
>>>
>>>3) Since the communication between main logical apply worker and 
>>>each bgworker from the pool is a 'single producer --- single 
>>>consumer' problem, then probably it is possible to wait and 
>>>set/check flags without locks, but using just atomics.
>>>
>>>What do you think about this concept in general? Any concerns and 
>>>criticism are welcome!
>>>
>>
>
>Hi Tomas,
>
>Thank you for a quick response.
>
>>I don't think it matters very much whether the workers are started at the
>>beginning or allocated ad hoc, that's IMO a minor implementation detail.
>
>OK, I had the same vision about this point. Any minor differences here 
>will be neglectable for a sufficiently large transaction.
>
>>
>>There's one huge challenge that I however don't see mentioned in your
>>message or in the patch (after cursory reading) - ensuring the same 
>>commit
>>order, and introducing deadlocks that would not exist in single-process
>>apply.
>
>Probably I haven't explained well this part, sorry for that. In my 
>patch I don't use workers pool for a concurrent transaction apply, but 
>rather for a fast context switch between long-lived streamed 
>transactions. In other words we apply all changes arrived from the 
>sender in a completely serial manner. Being written step-by-step it 
>looks like:
>
>1) Read STREAM START message and figure out the target worker by xid.
>
>2) Put all changes, which belongs to this xact to the selected worker 
>one by one via shm_mq_send.
>
>3) Read STREAM STOP message and wait until our worker will apply all 
>changes in the queue.
>
>4) Process all other chunks of streamed xacts in the same manner.
>
>5) Process all non-streamed xacts immediately in the main apply worker loop.
>
>6) If we read STREAMED COMMIT/ABORT we again wait until selected 
>worker either commits or aborts.
>
>Thus, it automatically guaranties the same commit order on replica as 
>on master. Yes, we loose some performance here, since we don't apply 
>transactions concurrently, but it would bring all those problems you 
>have described.
>

OK, so it's apply in multiple processes, but at any moment only a single
apply process is active. 

>However, you helped me to figure out another point I have forgotten. 
>Although we ensure commit order automatically, the beginning of 
>streamed xacts may reorder. It happens if some small xacts have been 
>commited on master since the streamed one started, because we do not 
>start streaming immediately, but only after logical_work_mem hit. I 
>have performed some tests with conflicting xacts and it seems that 
>it's not a problem, since locking mechanism in Postgres guarantees 
>that if there would some deadlocks, they will happen earlier on 
>master. So if some records hit the WAL, it is safe to apply the 
>sequentially. Am I wrong?
>

I think you're right the way you interleave the changes ensures you
can't introduce new deadlocks between transactions in this stream. I don't
think reordering the blocks of streamed trasactions does matter, as long
as the comit order is ensured in this case.

>Anyway, I'm going to double check the safety of this part later.
>

OK.

FWIW my understanding is that the speedup comes mostly from elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Konstantin Knizhnik
Дата:
>
> FWIW my understanding is that the speedup comes mostly from 
> elimination of
> the serialization to a file. That however requires savepoints to handle
> aborts of subtransactions - I'm pretty sure I'd be trivial to create a
> workload where this will be much slower (with many aborts of large
> subtransactions).
>
>

I think that instead of defining savepoints it is simpler and more 
efficient to use

BeginInternalSubTransaction + 
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction

as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alvaro Herrera
Дата:
In the interest of moving things forward, how far are we from making
0001 committable?  If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
>In the interest of moving things forward, how far are we from making
>0001 committable?  If I understand correctly, the rest of this patchset
>depends on https://commitfest.postgresql.org/24/944/ which seems to be
>moving at a glacial pace (or, actually, slower, because glaciers do
>move, which cannot be said of that other patch.)
>

I think 0001 is mostly there. I think there's one bug in this patch
version, but I need to check and I'll post an updated version shortly if
needed.

FWIW maybe we should stop comparing things to glaciers. 50 years from not
people won't know what a glacier is, and it'll be just like the floppy
icon on the save button.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alexey Kondratov
Дата:
>>
>> FWIW my understanding is that the speedup comes mostly from 
>> elimination of
>> the serialization to a file. That however requires savepoints to handle
>> aborts of subtransactions - I'm pretty sure I'd be trivial to create a
>> workload where this will be much slower (with many aborts of large
>> subtransactions).
>>

Yes, and it was my main motivation to eliminate that extra serialization 
to file. I've experimented a bit with large transactions + savepoints + 
aborts and ended up with a following query (the same schema as before 
with 600k rows):

BEGIN;
SAVEPOINT s1;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s2;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s3;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
ROLLBACK TO SAVEPOINT s3;
ROLLBACK TO SAVEPOINT s2;
ROLLBACK TO SAVEPOINT s1;
END;

It looks like the worst case scenario, as we do a lot of work and then 
abort all subxacts one by one. As expected,it takes much longer (up to 
x30) to process using background worker instead of spilling to file. 
Surely, it is much easier to truncate a file, than apply all changes + 
abort. However, I guess that this kind of load pattern is not the most 
typical for real-life applications.

Also this test helped me to find a bug in my current savepoints routine, 
so new patch is attached.

On 30.08.2019 18:59, Konstantin Knizhnik wrote:
>
> I think that instead of defining savepoints it is simpler and more 
> efficient to use
>
> BeginInternalSubTransaction + 
> ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>
> as it is done in PL/pgSQL (pl_exec.c).
> Not sure if it can pr
>

Both BeginInternalSubTransaction and DefineSavepoint use 
PushTransaction() internally for a normal subtransaction start. So they 
seems to be identical from the performance perspective, which is also 
stated in the comment section:

/*
  * BeginInternalSubTransaction
  *        This is the same as DefineSavepoint except it allows 
TBLOCK_STARTED,
  *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_END, and TBLOCK_PREPARE 
states,
  *        and therefore it can safely be used in functions that might 
be called
  *        when not inside a BEGIN block or when running deferred 
triggers at
  *        COMMIT/PREPARE time.  Also, it automatically does
  *        CommitTransactionCommand/StartTransactionCommand instead of 
expecting
  *        the caller to do it.
  */

Please, correct me if I'm wrong.

Anyway, I've performed a profiling of my apply worker (flamegraph is 
attached) and it spends the vast amount of time (>90%) applying changes. 
So the problem is not in the savepoints their-self, but in the fact that 
we first apply all changes and then abort all the work. Not sure, that 
it is possible to do something in this case.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company


Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Konstantin Knizhnik
Дата:

On 16.09.2019 19:54, Alexey Kondratov wrote:
> On 30.08.2019 18:59, Konstantin Knizhnik wrote:
>>
>> I think that instead of defining savepoints it is simpler and more 
>> efficient to use
>>
>> BeginInternalSubTransaction + 
>> ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>>
>> as it is done in PL/pgSQL (pl_exec.c).
>> Not sure if it can pr
>>
>
> Both BeginInternalSubTransaction and DefineSavepoint use 
> PushTransaction() internally for a normal subtransaction start. So 
> they seems to be identical from the performance perspective, which is 
> also stated in the comment section:

Yes, definitely them are using the same mechanism and most likely 
provides similar performance.
But BeginInternalSubTransaction does not require to generate some 
savepoint name which seems to be redundant in this case.


>
> Anyway, I've performed a profiling of my apply worker (flamegraph is 
> attached) and it spends the vast amount of time (>90%) applying 
> changes. So the problem is not in the savepoints their-self, but in 
> the fact that we first apply all changes and then abort all the work. 
> Not sure, that it is possible to do something in this case.
>

Looks like the only way to increase apply speed is to do it in parallel: 
make it possible to concurrently execute non-conflicting transactions.





Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Mon, Sep 16, 2019 at 10:29:18PM +0300, Konstantin Knizhnik wrote:
>
>
>On 16.09.2019 19:54, Alexey Kondratov wrote:
>>On 30.08.2019 18:59, Konstantin Knizhnik wrote:
>>>
>>>I think that instead of defining savepoints it is simpler and more 
>>>efficient to use
>>>
>>>BeginInternalSubTransaction + 
>>>ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>>>
>>>as it is done in PL/pgSQL (pl_exec.c).
>>>Not sure if it can pr
>>>
>>
>>Both BeginInternalSubTransaction and DefineSavepoint use 
>>PushTransaction() internally for a normal subtransaction start. So 
>>they seems to be identical from the performance perspective, which 
>>is also stated in the comment section:
>
>Yes, definitely them are using the same mechanism and most likely 
>provides similar performance.
>But BeginInternalSubTransaction does not require to generate some 
>savepoint name which seems to be redundant in this case.
>
>
>>
>>Anyway, I've performed a profiling of my apply worker (flamegraph is 
>>attached) and it spends the vast amount of time (>90%) applying 
>>changes. So the problem is not in the savepoints their-self, but in 
>>the fact that we first apply all changes and then abort all the 
>>work. Not sure, that it is possible to do something in this case.
>>
>
>Looks like the only way to increase apply speed is to do it in 
>parallel: make it possible to concurrently execute non-conflicting 
>transactions.
>

True, although it seems like a massive can of worms to me. I'm not aware
a way to identify non-conflicting transactions in advance, so it would
have to be implemented as optimistic apply, with a detection and
recovery from conflicts.

I'm not against doing that, and I'm willing to spend some time on revies
etc. but it seems like a completely separate effort.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> In the interest of moving things forward, how far are we from making
> 0001 committable?  If I understand correctly, the rest of this patchset
> depends on https://commitfest.postgresql.org/24/944/ which seems to be
> moving at a glacial pace (or, actually, slower, because glaciers do
> move, which cannot be said of that other patch.)
>

I am not sure if it is completely correct that the other part of the
patch is dependent on that CF entry.  I have studied both the threads
(not every detail) and it seems to me it is dependent on one of the
patches from that series which handles concurrent aborts.  It is patch
0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
from what the Nikhil has posted on that thread [1].  Am, I wrong?

So IIUC, the problem of concurrent aborts is that if we allow catalog
scans for in-progress transactions, then we might get wrong answers in
cases where somebody has performed Alter-Abort-Alter which is clearly
explained with an example in email [2].  To solve that problem Nikhil
seems to have written a patch [1] which detects these concurrent
aborts during a system table scan and then aborts the decoding of such
a transaction.

Now, the problem is that patch has written considering 2PC
transactions and might not deal with all cases for in-progress
transactions especially when sub-transactions are involved as alluded
by Arseny Sher [3].  So, the problem seems to be for cases when some
sub-transaction aborts, but the main transaction still continued and
we try to decode it.  Nikhil's patch won't be able to deal with it
because I think it just checks top-level xid whereas for this we need
to check all-subxids which I think is possible now as Tomas seems to
have written WAL for each xid-assignment.  It might or might not be
the best solution to check the status of all-subxids, but I think
first we need to agree that the problem is just for concurrent aborts
and that we can solve it by using some part of the technology being
developed as part of patch "Logical decoding of two-phase
transactions" (https://commitfest.postgresql.org/24/944/) rather than
the entire patchset.

I hope I am not saying something very obvious here and it helps in
moving this patch forward.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAMGcDxcBmN6jNeQkgWddfhX8HbSjQpW%3DUo70iBY3P_EPdp%2BLTQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/EEBD82AA-61EE-46F4-845E-05B94168E8F2%40postgrespro.ru
[3] - https://www.postgresql.org/message-id/87a7py4iwl.fsf%40ars-thinkpad

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
> >In the interest of moving things forward, how far are we from making
> >0001 committable?  If I understand correctly, the rest of this patchset
> >depends on https://commitfest.postgresql.org/24/944/ which seems to be
> >moving at a glacial pace (or, actually, slower, because glaciers do
> >move, which cannot be said of that other patch.)
> >
>
> I think 0001 is mostly there. I think there's one bug in this patch
> version, but I need to check and I'll post an updated version shortly if
> needed.
>

Did you get a chance to work on 0001?  I have a few comments on that patch:
1.
+ *   To limit the amount of memory used by decoded changes, we track memory
+ *   used at the reorder buffer level (i.e. total amount of memory), and for
+ *   each toplevel transaction. When the total amount of used memory exceeds
+ *   the limit, the toplevel transaction consuming the most memory is either
+ *   serialized or streamed.

Do we need to mention 'streamed' as part of this patch?  It seems to
me that this is an independent patch which can be committed without
patches that stream the changes. So, we can remove it from here and
other places where it is used.

2.
+ *   deserializing and applying very few changes). We probably to give more
+ *   memory to the oldest subtransactions.

/We probably to/
It seems some word is missing after probably.

3.
+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
+ *
+ * XXX With many subtransactions this might be quite slow, because we'll have
+ * to walk through all of them. There are some options how we could improve
+ * that: (a) maintain some secondary structure with transactions sorted by
+ * amount of changes, (b) not looking for the entirely largest transaction,
+ * but e.g. for transaction using at least some fraction of the memory limit,
+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
+ * of the memory limit (e.g. 50%).
+ */
+static ReorderBufferTXN *
+ReorderBufferLargestTXN(ReorderBuffer *rb)

What is the guarantee that after evicting largest transaction, we
won't immediately hit the memory limit?  Say, all of the transactions
are of almost similar size which I don't think is that uncommon a
case.  Instead, the strategy mentioned in point (c) or something like
that seems more promising.  In that strategy, there is some risk that
it might lead to many smaller disk writes which we might want to
control via some threshold (like we should not flush more than N
xacts).  In this, we also need to ensure that the total memory freed
must be greater than the current change.

I think we have some discussion around this point but didn't reach any
conclusion which means some more brainstorming is required.

4.
+int logical_work_mem; /* 4MB */

What this 4MB in comments indicate?

5.
+/*
+ * Check whether the logical_work_mem limit was reached, and if yes pick
+ * the transaction tx should spill its data to disk.

The second part of the sentence "pick the transaction tx should spill"
seems to be incomplete.

Apart from this, I see that Peter E. has raised some other points on
this patch which are not yet addressed as those also need some
discussion, so I will respond to those separately with my opinion.

These comments are based on the last patch posted by you on this
thread [1].  You might have fixed some of these already, so ignore if
that is the case.

[1] - https://www.postgresql.org/message-id/76fc440e-91c3-afe2-b78a-987205b3c758%402ndquadrant.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
Hi,

Attached is an updated patch series, rebased on current master. It does
fix one memory accounting bug in ReorderBufferToastReplace (the code was
not properly updating the amount of memory).

I've also included the patch series with decoding of 2PC transactions,
which this depends on. This way we have a chance of making the cfbot
happy. So parts 0001-0004 and 0009-0014 are "this" patch series, while
0005-0008 are the extra pieces from the other patch.

I've done it like this because the initial parts are independent, and so
might be committed irrespectedly of the other patch series. In practice
that's only reasonable for 0001, which adds the memory limit - the rest
is infrastucture for the streaming of in-progress transactions.

On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote:
>On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> In the interest of moving things forward, how far are we from making
>> 0001 committable?  If I understand correctly, the rest of this patchset
>> depends on https://commitfest.postgresql.org/24/944/ which seems to be
>> moving at a glacial pace (or, actually, slower, because glaciers do
>> move, which cannot be said of that other patch.)
>>
>
>I am not sure if it is completely correct that the other part of the
>patch is dependent on that CF entry.  I have studied both the threads
>(not every detail) and it seems to me it is dependent on one of the
>patches from that series which handles concurrent aborts.  It is patch
>0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
>from what the Nikhil has posted on that thread [1].  Am, I wrong?
>

You're right - the part handling aborts is the only part required. There
are dependencies on some other changes from the 2PC patch, but those are
mostly refactorings that can be undone (e.g. switch from independent
flags to a single bitmap in reorderbuffer).

>So IIUC, the problem of concurrent aborts is that if we allow catalog
>scans for in-progress transactions, then we might get wrong answers in
>cases where somebody has performed Alter-Abort-Alter which is clearly
>explained with an example in email [2].  To solve that problem Nikhil
>seems to have written a patch [1] which detects these concurrent
>aborts during a system table scan and then aborts the decoding of such
>a transaction.
>
>Now, the problem is that patch has written considering 2PC
>transactions and might not deal with all cases for in-progress
>transactions especially when sub-transactions are involved as alluded
>by Arseny Sher [3].  So, the problem seems to be for cases when some
>sub-transaction aborts, but the main transaction still continued and
>we try to decode it.  Nikhil's patch won't be able to deal with it
>because I think it just checks top-level xid whereas for this we need
>to check all-subxids which I think is possible now as Tomas seems to
>have written WAL for each xid-assignment.  It might or might not be
>the best solution to check the status of all-subxids, but I think
>first we need to agree that the problem is just for concurrent aborts
>and that we can solve it by using some part of the technology being
>developed as part of patch "Logical decoding of two-phase
>transactions" (https://commitfest.postgresql.org/24/944/) rather than
>the entire patchset.
>
>I hope I am not saying something very obvious here and it helps in
>moving this patch forward.
>

No, that's a good question, and I'm not sure what the answer is at the
moment. My understanding was that the infrastructure in the 2PC patch is
enough even for subtransactions, but I might be wrong. I need to think
about that for a while.

Maybe we should focus on the 0001 part for now - it can be committed
indepently and does provide useful feature.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote:
>On Tue, Sep 3, 2019 at 4:16 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
>> >In the interest of moving things forward, how far are we from making
>> >0001 committable?  If I understand correctly, the rest of this patchset
>> >depends on https://commitfest.postgresql.org/24/944/ which seems to be
>> >moving at a glacial pace (or, actually, slower, because glaciers do
>> >move, which cannot be said of that other patch.)
>> >
>>
>> I think 0001 is mostly there. I think there's one bug in this patch
>> version, but I need to check and I'll post an updated version shortly if
>> needed.
>>
>
>Did you get a chance to work on 0001?  I have a few comments on that patch:
>1.
>+ *   To limit the amount of memory used by decoded changes, we track memory
>+ *   used at the reorder buffer level (i.e. total amount of memory), and for
>+ *   each toplevel transaction. When the total amount of used memory exceeds
>+ *   the limit, the toplevel transaction consuming the most memory is either
>+ *   serialized or streamed.
>
>Do we need to mention 'streamed' as part of this patch?  It seems to
>me that this is an independent patch which can be committed without
>patches that stream the changes. So, we can remove it from here and
>other places where it is used.
>

You're right - this patch should not mention streaming because the parts
adding that capability are later in the series. So it can trigger just
the serialization to disk.

>2.
>+ *   deserializing and applying very few changes). We probably to give more
>+ *   memory to the oldest subtransactions.
>
>/We probably to/
>It seems some word is missing after probably.
>

Yes.

>3.
>+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
>+ *
>+ * XXX With many subtransactions this might be quite slow, because we'll have
>+ * to walk through all of them. There are some options how we could improve
>+ * that: (a) maintain some secondary structure with transactions sorted by
>+ * amount of changes, (b) not looking for the entirely largest transaction,
>+ * but e.g. for transaction using at least some fraction of the memory limit,
>+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
>+ * of the memory limit (e.g. 50%).
>+ */
>+static ReorderBufferTXN *
>+ReorderBufferLargestTXN(ReorderBuffer *rb)
>
>What is the guarantee that after evicting largest transaction, we
>won't immediately hit the memory limit?  Say, all of the transactions
>are of almost similar size which I don't think is that uncommon a
>case.

Not sure I understand - what do you mean 'immediately hit'?

We do check the limit after queueing a change, and we know that this
change is what got us over the limit. We pick the largest transaction
(which has to be larger than the change we just entered) and evict it,
getting below the memory limit again.

The next change can get us over the memory limit again, of course, but
there's not much we could do about that.

>  Instead, the strategy mentioned in point (c) or something like
>that seems more promising.  In that strategy, there is some risk that
>it might lead to many smaller disk writes which we might want to
>control via some threshold (like we should not flush more than N
>xacts).  In this, we also need to ensure that the total memory freed
>must be greater than the current change.
>
>I think we have some discussion around this point but didn't reach any
>conclusion which means some more brainstorming is required.
>

I agree it's worth investigating, but I'm not sure it's necessary before
committing v1 of the feature. I don't think there's a clear winner
strategy, and the current approach works fairly well I think.

The comment is concerned with the cost of ReorderBufferLargestTXN with
many transactions, but we can only have certain number of top-level
transactions (max_connections + certain number of not-yet-assigned
subtransactions). And 0002 patch essentially gets rid of the subxacts
entirely, further reducing the maximum number of xacts to walk.

>4.
>+int logical_work_mem; /* 4MB */
>
>What this 4MB in comments indicate?
>

Sorry, that's a mistake.

>5.
>+/*
>+ * Check whether the logical_work_mem limit was reached, and if yes pick
>+ * the transaction tx should spill its data to disk.
>
>The second part of the sentence "pick the transaction tx should spill"
>seems to be incomplete.
>

Yeah, that's a poor wording. Will fix.

>Apart from this, I see that Peter E. has raised some other points on
>this patch which are not yet addressed as those also need some
>discussion, so I will respond to those separately with my opinion.
>

OK, thanks.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alvaro Herrera
Дата:
On 2019-Sep-26, Tomas Vondra wrote:

> Hi,
> 
> Attached is an updated patch series, rebased on current master. It does
> fix one memory accounting bug in ReorderBufferToastReplace (the code was
> not properly updating the amount of memory).

Cool.

Can we aim to get 0001 pushed during this commitfest, or is that a lost
cause?

The large new comment in reorderbuffer.c says that a transaction might
get spilled *or streamed*, but surely that second thing is not correct,
since before the subsequent patches it's not possible to stream
transactions that have not yet finished?

How certain are you about the approach to measure memory used by a
reorderbuffer transaction ... does it not cause a measurable performance
drop?  I wonder if it would make more sense to use a separate contexts
per transaction and use context-level accounting (per the patch Jeff
Davis posted elsewhere for hash joins ... though I see now that that
only works fot aset.c, not other memcxt implementations), or something
like that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alvaro Herrera
Дата:
On 2019-Sep-26, Alvaro Herrera wrote:

> How certain are you about the approach to measure memory used by a
> reorderbuffer transaction ... does it not cause a measurable performance
> drop?  I wonder if it would make more sense to use a separate contexts
> per transaction and use context-level accounting (per the patch Jeff
> Davis posted elsewhere for hash joins ... though I see now that that
> only works fot aset.c, not other memcxt implementations), or something
> like that.

Oh, I just noticed that that patch was posted separately in its own
thread, and that that improved version does include support for other
memory context implementations.  Excellent.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Thu, Sep 26, 2019 at 04:36:20PM -0300, Alvaro Herrera wrote:
>On 2019-Sep-26, Alvaro Herrera wrote:
>
>> How certain are you about the approach to measure memory used by a
>> reorderbuffer transaction ... does it not cause a measurable performance
>> drop?  I wonder if it would make more sense to use a separate contexts
>> per transaction and use context-level accounting (per the patch Jeff
>> Davis posted elsewhere for hash joins ... though I see now that that
>> only works fot aset.c, not other memcxt implementations), or something
>> like that.
>
>Oh, I just noticed that that patch was posted separately in its own
>thread, and that that improved version does include support for other
>memory context implementations.  Excellent.
>

Unfortunately, that won't fly, for two simple reasons:

1) The memory accounting patch is known to perform poorly with many
child contexts - this was why array_agg/string_agg were problematic,
before we rewrote them not to create memory context for each group.

It could be done differently (eager accounting) but then the overhead
for regular/common cases (with just a couple of contexts) is higher. So
that seems like a much inferior option.

2) We can't actually have a single context per transaction. Some parts
(REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID) of a transaction are not
evicted, so we'd have to keep them in a separate context.

It'd also mean higher allocation overhead, because now we can reuse
chunks cross-transaction. So one transaction commits or gets serialized,
and we reuse the chunks for something else. With per-transaction
contexts we'd lose some of this benefit - we could only reuse chunks
within a transaction (i.e. large transactions that get spilled to disk)
but not across commits.

I don't have any numbers, of course, but I wouldn't be surprised if it
was significant e.g. for small transactions that don't get spilled. And
creating/destroying the contexts is not free either, I think.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Thu, Sep 26, 2019 at 04:33:59PM -0300, Alvaro Herrera wrote:
>On 2019-Sep-26, Tomas Vondra wrote:
>
>> Hi,
>>
>> Attached is an updated patch series, rebased on current master. It does
>> fix one memory accounting bug in ReorderBufferToastReplace (the code was
>> not properly updating the amount of memory).
>
>Cool.
>
>Can we aim to get 0001 pushed during this commitfest, or is that a lost
>cause?
>

It's tempting. The patch has been in the queue for quite a bit of time,
and I think it's solid (at least 0001). I'll address the comments from
Peter's review about separating the GUC etc. and polish it a bit more.
If I manage to do that by Monday, I'll consider pushing it.

If anyone feels I shouldn't do that, let me know.

The one open question pointed out by Amit is how the patch picks the
trasction for eviction. My feeling is that's fine and if needed can be
improved later if necessary, but I'll try to construct a worst case
(max_connections xacts, each with 64 subxact) to verify.

>The large new comment in reorderbuffer.c says that a transaction might
>get spilled *or streamed*, but surely that second thing is not correct,
>since before the subsequent patches it's not possible to stream
>transactions that have not yet finished?
>

True. That's a residue of reordering the patch series repeatedly, I
think. I'll fix that while polishing the patch.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Sep 27, 2019 at 12:06 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Thu, Sep 26, 2019 at 06:58:17PM +0530, Amit Kapila wrote:
>
> >3.
> >+ * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
> >+ *
> >+ * XXX With many subtransactions this might be quite slow, because we'll have
> >+ * to walk through all of them. There are some options how we could improve
> >+ * that: (a) maintain some secondary structure with transactions sorted by
> >+ * amount of changes, (b) not looking for the entirely largest transaction,
> >+ * but e.g. for transaction using at least some fraction of the memory limit,
> >+ * and (c) evicting multiple transactions at once, e.g. to free a given portion
> >+ * of the memory limit (e.g. 50%).
> >+ */
> >+static ReorderBufferTXN *
> >+ReorderBufferLargestTXN(ReorderBuffer *rb)
> >
> >What is the guarantee that after evicting largest transaction, we
> >won't immediately hit the memory limit?  Say, all of the transactions
> >are of almost similar size which I don't think is that uncommon a
> >case.
>
> Not sure I understand - what do you mean 'immediately hit'?
>
> We do check the limit after queueing a change, and we know that this
> change is what got us over the limit. We pick the largest transaction
> (which has to be larger than the change we just entered) and evict it,
> getting below the memory limit again.
>
> The next change can get us over the memory limit again, of course,
>

Yeah, this is what I want to say when I wrote that it can immediately hit again.

> but
> there's not much we could do about that.
>
> >  Instead, the strategy mentioned in point (c) or something like
> >that seems more promising.  In that strategy, there is some risk that
> >it might lead to many smaller disk writes which we might want to
> >control via some threshold (like we should not flush more than N
> >xacts).  In this, we also need to ensure that the total memory freed
> >must be greater than the current change.
> >
> >I think we have some discussion around this point but didn't reach any
> >conclusion which means some more brainstorming is required.
> >
>
> I agree it's worth investigating, but I'm not sure it's necessary before
> committing v1 of the feature. I don't think there's a clear winner
> strategy, and the current approach works fairly well I think.
>
> The comment is concerned with the cost of ReorderBufferLargestTXN with
> many transactions, but we can only have certain number of top-level
> transactions (max_connections + certain number of not-yet-assigned
> subtransactions). And 0002 patch essentially gets rid of the subxacts
> entirely, further reducing the maximum number of xacts to walk.
>

That would be good, but I don't understand how.  The second patch will
update the subxacts in top-level ReorderBufferTxn, but it won't remove
it from hash table.  It also doesn't seem to be caring for considering
the size of subxacts in top-level xact, so not sure how will it reduce
the number of xacts to walk.  I might be missing something here.  Can
you explain a bit how 0002 patch would help in reducing the maximum
number of xacts to walk?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>
> On 1/3/18 14:53, Tomas Vondra wrote:
> >> I don't see the need to tie this setting to maintenance_work_mem.
> >> maintenance_work_mem is often set to very large values, which could
> >> then have undesirable side effects on this use.
> >
> > Well, we need to pick some default value, and we can either use a fixed
> > value (not sure what would be a good default) or tie it to an existing
> > GUC. We only really have work_mem and maintenance_work_mem, and the
> > walsender process will never use more than one such buffer. Which seems
> > to be closer to maintenance_work_mem.
> >
> > Pretty much any default value can have undesirable side effects.
>
> Let's just make it an independent setting unless we know any better.  We
> don't have a lot of settings that depend on other settings, and the ones
> we do have a very specific relationship.
>
> >> Moreover, the name logical_work_mem makes it sound like it's a logical
> >> version of work_mem.  Maybe we could think of another name.
> >
> > I won't object to a better name, of course. Any proposals?
>
> logical_decoding_[work_]mem?
>

Having a separate variable for this can give more flexibility, but
OTOH it will add one more knob which user might not have a good idea
to set.  What are the problems we see if directly use work_mem for
this case?

If we can't use work_mem, then I think the name proposed by you
(logical_decoding_work_mem) sounds good to me.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote:
> >On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >>
> >> In the interest of moving things forward, how far are we from making
> >> 0001 committable?  If I understand correctly, the rest of this patchset
> >> depends on https://commitfest.postgresql.org/24/944/ which seems to be
> >> moving at a glacial pace (or, actually, slower, because glaciers do
> >> move, which cannot be said of that other patch.)
> >>
> >
> >I am not sure if it is completely correct that the other part of the
> >patch is dependent on that CF entry.  I have studied both the threads
> >(not every detail) and it seems to me it is dependent on one of the
> >patches from that series which handles concurrent aborts.  It is patch
> >0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
> >from what the Nikhil has posted on that thread [1].  Am, I wrong?
> >
>
> You're right - the part handling aborts is the only part required. There
> are dependencies on some other changes from the 2PC patch, but those are
> mostly refactorings that can be undone (e.g. switch from independent
> flags to a single bitmap in reorderbuffer).
>
> >So IIUC, the problem of concurrent aborts is that if we allow catalog
> >scans for in-progress transactions, then we might get wrong answers in
> >cases where somebody has performed Alter-Abort-Alter which is clearly
> >explained with an example in email [2].  To solve that problem Nikhil
> >seems to have written a patch [1] which detects these concurrent
> >aborts during a system table scan and then aborts the decoding of such
> >a transaction.
> >
> >Now, the problem is that patch has written considering 2PC
> >transactions and might not deal with all cases for in-progress
> >transactions especially when sub-transactions are involved as alluded
> >by Arseny Sher [3].  So, the problem seems to be for cases when some
> >sub-transaction aborts, but the main transaction still continued and
> >we try to decode it.  Nikhil's patch won't be able to deal with it
> >because I think it just checks top-level xid whereas for this we need
> >to check all-subxids which I think is possible now as Tomas seems to
> >have written WAL for each xid-assignment.  It might or might not be
> >the best solution to check the status of all-subxids, but I think
> >first we need to agree that the problem is just for concurrent aborts
> >and that we can solve it by using some part of the technology being
> >developed as part of patch "Logical decoding of two-phase
> >transactions" (https://commitfest.postgresql.org/24/944/) rather than
> >the entire patchset.
> >
> >I hope I am not saying something very obvious here and it helps in
> >moving this patch forward.
> >
>
> No, that's a good question, and I'm not sure what the answer is at the
> moment. My understanding was that the infrastructure in the 2PC patch is
> enough even for subtransactions, but I might be wrong.
>

I also think the patch that handles concurrent aborts should be
sufficient, but that need to be integrated with your patch.  Earlier,
I thought we need to check whether any of the subtransaction is
aborted as mentioned by Arseny Sher, but now after thinking again
about that problem, it seems that checking only the status current
subtransaction should be sufficient.  Because, if the user does
Rollback to Savepoint concurrently which aborts multiple
subtransactions, the latest one must be aborted as well which is what
I think we want to detect.  Once we detect that we have two options
(a) restart the decode of that transaction by removing changes of all
subxacts or (b) somehow mark the transaction such that it gets decoded
only at the commit time.

>
> Maybe we should focus on the 0001 part for now - it can be committed
> indepently and does provide useful feature.
>

If that can be done sooner, then it is fine, but otherwise, preparing
the patches on top of HEAD can facilitate the review of those.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:
>On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
><peter.eisentraut@2ndquadrant.com> wrote:
>>
>> On 1/3/18 14:53, Tomas Vondra wrote:
>> >> I don't see the need to tie this setting to maintenance_work_mem.
>> >> maintenance_work_mem is often set to very large values, which could
>> >> then have undesirable side effects on this use.
>> >
>> > Well, we need to pick some default value, and we can either use a fixed
>> > value (not sure what would be a good default) or tie it to an existing
>> > GUC. We only really have work_mem and maintenance_work_mem, and the
>> > walsender process will never use more than one such buffer. Which seems
>> > to be closer to maintenance_work_mem.
>> >
>> > Pretty much any default value can have undesirable side effects.
>>
>> Let's just make it an independent setting unless we know any better.  We
>> don't have a lot of settings that depend on other settings, and the ones
>> we do have a very specific relationship.
>>
>> >> Moreover, the name logical_work_mem makes it sound like it's a logical
>> >> version of work_mem.  Maybe we could think of another name.
>> >
>> > I won't object to a better name, of course. Any proposals?
>>
>> logical_decoding_[work_]mem?
>>
>
>Having a separate variable for this can give more flexibility, but
>OTOH it will add one more knob which user might not have a good idea
>to set.  What are the problems we see if directly use work_mem for
>this case?
>

IMHO it's similar to autovacuum_work_mem - we have an independent
setting, but most people use it as -1 so we use maintenance_work_mem as
a default value. I think it makes sense to do the same thing here.

It does ad an extra knob anyway (I don't think we should just use
maintenance_work_mem directly, the user should have an option to
override it when needed). But most users will not notice.

FWIW I don't think we should use work_mem, maintenace_work_mem seems
somewhat more appropriate here (not related to queries, etc.).

>If we can't use work_mem, then I think the name proposed by you
>(logical_decoding_work_mem) sounds good to me.
>

Yes, that name seems better.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:
> >On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
> ><peter.eisentraut@2ndquadrant.com> wrote:
> >>
> >> On 1/3/18 14:53, Tomas Vondra wrote:
> >> >> I don't see the need to tie this setting to maintenance_work_mem.
> >> >> maintenance_work_mem is often set to very large values, which could
> >> >> then have undesirable side effects on this use.
> >> >
> >> > Well, we need to pick some default value, and we can either use a fixed
> >> > value (not sure what would be a good default) or tie it to an existing
> >> > GUC. We only really have work_mem and maintenance_work_mem, and the
> >> > walsender process will never use more than one such buffer. Which seems
> >> > to be closer to maintenance_work_mem.
> >> >
> >> > Pretty much any default value can have undesirable side effects.
> >>
> >> Let's just make it an independent setting unless we know any better.  We
> >> don't have a lot of settings that depend on other settings, and the ones
> >> we do have a very specific relationship.
> >>
> >> >> Moreover, the name logical_work_mem makes it sound like it's a logical
> >> >> version of work_mem.  Maybe we could think of another name.
> >> >
> >> > I won't object to a better name, of course. Any proposals?
> >>
> >> logical_decoding_[work_]mem?
> >>
> >
> >Having a separate variable for this can give more flexibility, but
> >OTOH it will add one more knob which user might not have a good idea
> >to set.  What are the problems we see if directly use work_mem for
> >this case?
> >
>
> IMHO it's similar to autovacuum_work_mem - we have an independent
> setting, but most people use it as -1 so we use maintenance_work_mem as
> a default value. I think it makes sense to do the same thing here.
>
> It does ad an extra knob anyway (I don't think we should just use
> maintenance_work_mem directly, the user should have an option to
> override it when needed). But most users will not notice.
>
> FWIW I don't think we should use work_mem, maintenace_work_mem seems
> somewhat more appropriate here (not related to queries, etc.).
>

I have the same concern for using maintenace_work_mem as Peter E.
which is that the value of maintenace_work_mem will generally be
higher which is suitable for its current purpose, but not for the
purpose this patch is using.  AFAIU, at this stage we want a better
memory accounting system for logical decoding and we are not sure what
is a good value for this variable.  So, I think using work_mem or
maintenace_work_mem should serve the purpose.  Later, if we have
requirements from people to have better control over the memory
required for this purpose then we can introduce a new variable.

I understand that currently work_mem is primarily tied with memory
used for query workspaces, but it might be okay to extend it for this
purpose.  Another point is that the default for that sound to be more
appealing for this case.  I can see the argument against it which is
having a separate variable will make the things look clean and give
better control.  So, if we can't convince ourselves for using
work_mem, we can introduce a new guc variable and keep the default as
4MB or work_mem.

I feel it is always tempting to introduce a new guc for the different
tasks unless there is an exact match, but OTOH, having lesser guc's
has its own advantage which is that people don't have to bother about
a new setting which they need to tune and especially for which they
can't decide with ease.  I am not telling that we should not introduce
new guc when it is required, but just to give more thought before
doing so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Sep 26, 2019 at 11:37 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> Attached is an updated patch series, rebased on current master. It does
> fix one memory accounting bug in ReorderBufferToastReplace (the code was
> not properly updating the amount of memory).
>

Few comments on 0001
1.
I am getting below linking error in pgoutput when compiling the patch
on my windows system:
pgoutput.obj : error LNK2001: unresolved external symbol _logical_work_mem

You need to use PGDLLIMPORT for logical_work_mem.

2. After, I fixed above and tried some basic test, it fails with below
callstack:
  postgres.exe!ExceptionalCondition(const char *
conditionName=0x00d92854, const char * errorType=0x00d928bc, const
char * fileName=0x00d92e60,
int lineNumber=2148)  Line 55
  postgres.exe!ReorderBufferChangeMemoryUpdate(ReorderBuffer *
rb=0x02693390, ReorderBufferChange * change=0x0269dd38, bool
addition=true)  Line 2148
  postgres.exe!ReorderBufferQueueChange(ReorderBuffer * rb=0x02693390,
unsigned int xid=525, unsigned __int64 lsn=36083720,
ReorderBufferChange
* change=0x0269dd38)  Line 635
  postgres.exe!DecodeInsert(LogicalDecodingContext * ctx=0x0268ef80,
XLogRecordBuffer * buf=0x012cf718)  Line 716 + 0x24 bytes C
  postgres.exe!DecodeHeapOp(LogicalDecodingContext * ctx=0x0268ef80,
XLogRecordBuffer * buf=0x012cf718)  Line 437 + 0xd bytes C
  postgres.exe!LogicalDecodingProcessRecord(LogicalDecodingContext *
ctx=0x0268ef80, XLogReaderState * record=0x0268f228)  Line 129
  postgres.exe!pg_logical_slot_get_changes_guts(FunctionCallInfoBaseData
* fcinfo=0x02688680, bool confirm=true, bool binary=false)  Line 307
  postgres.exe!pg_logical_slot_get_changes(FunctionCallInfoBaseData *
fcinfo=0x02688680)  Line 376

Basically, the assert added by you in ReorderBufferChangeMemoryUpdate
failed.  Then, I explored a bit and it seems that you have missed
assigning a value to txn, a new variable added by this patch in
structure ReorderBufferChange:
@@ -77,6 +82,9 @@ typedef struct ReorderBufferChange
  /* The type of change. */
  enum ReorderBufferChangeType action;

+ /* Transaction this change belongs to. */
+ struct ReorderBufferTXN *txn;


3.
@@ -206,6 +206,17 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_work_mem</varname>.
+         </para>
+        </listitem>
+       </varlistentry>

I don't see any explanation of how this will be useful?  How can a
subscriber predict the amount of memory required by a publisher for
decoding?  This is more unpredictable because when initially the
changes are recorded in ReorderBuffer, it doesn't even filter
corresponding to any publisher.  Do we really need this?  I think
giving more knobs to the user is helpful when they can someway know
how to use it.  In this case, it is not clear whether the user can
ever use this.

4. Can we some way expose the memory consumed by ReorderBuffer?  If
so, we might be able to write some tests covering new functionality.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:
>On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:
>> >On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
>> ><peter.eisentraut@2ndquadrant.com> wrote:
>> >>
>> >> On 1/3/18 14:53, Tomas Vondra wrote:
>> >> >> I don't see the need to tie this setting to maintenance_work_mem.
>> >> >> maintenance_work_mem is often set to very large values, which could
>> >> >> then have undesirable side effects on this use.
>> >> >
>> >> > Well, we need to pick some default value, and we can either use a fixed
>> >> > value (not sure what would be a good default) or tie it to an existing
>> >> > GUC. We only really have work_mem and maintenance_work_mem, and the
>> >> > walsender process will never use more than one such buffer. Which seems
>> >> > to be closer to maintenance_work_mem.
>> >> >
>> >> > Pretty much any default value can have undesirable side effects.
>> >>
>> >> Let's just make it an independent setting unless we know any better.  We
>> >> don't have a lot of settings that depend on other settings, and the ones
>> >> we do have a very specific relationship.
>> >>
>> >> >> Moreover, the name logical_work_mem makes it sound like it's a logical
>> >> >> version of work_mem.  Maybe we could think of another name.
>> >> >
>> >> > I won't object to a better name, of course. Any proposals?
>> >>
>> >> logical_decoding_[work_]mem?
>> >>
>> >
>> >Having a separate variable for this can give more flexibility, but
>> >OTOH it will add one more knob which user might not have a good idea
>> >to set.  What are the problems we see if directly use work_mem for
>> >this case?
>> >
>>
>> IMHO it's similar to autovacuum_work_mem - we have an independent
>> setting, but most people use it as -1 so we use maintenance_work_mem as
>> a default value. I think it makes sense to do the same thing here.
>>
>> It does ad an extra knob anyway (I don't think we should just use
>> maintenance_work_mem directly, the user should have an option to
>> override it when needed). But most users will not notice.
>>
>> FWIW I don't think we should use work_mem, maintenace_work_mem seems
>> somewhat more appropriate here (not related to queries, etc.).
>>
>
>I have the same concern for using maintenace_work_mem as Peter E.
>which is that the value of maintenace_work_mem will generally be
>higher which is suitable for its current purpose, but not for the
>purpose this patch is using.  AFAIU, at this stage we want a better
>memory accounting system for logical decoding and we are not sure what
>is a good value for this variable.  So, I think using work_mem or
>maintenace_work_mem should serve the purpose.  Later, if we have
>requirements from people to have better control over the memory
>required for this purpose then we can introduce a new variable.
>
>I understand that currently work_mem is primarily tied with memory
>used for query workspaces, but it might be okay to extend it for this
>purpose.  Another point is that the default for that sound to be more
>appealing for this case.  I can see the argument against it which is
>having a separate variable will make the things look clean and give
>better control.  So, if we can't convince ourselves for using
>work_mem, we can introduce a new guc variable and keep the default as
>4MB or work_mem.
>
>I feel it is always tempting to introduce a new guc for the different
>tasks unless there is an exact match, but OTOH, having lesser guc's
>has its own advantage which is that people don't have to bother about
>a new setting which they need to tune and especially for which they
>can't decide with ease.  I am not telling that we should not introduce
>new guc when it is required, but just to give more thought before
>doing so.
>

I do think having a separate GUC is a must, irrespectedly of what other
GUC (if any) is used as a default. You're right the maintenance_work_mem
value might be too high (e.g. in cases with many subscriptions), but the
same issue applies to work_mem - there's no guarantee work_mem is lower
than maintenance_work_mem, and in analytics databases it may be set very
high. So work_mem does not really solve the issue

IMHO we can't really do without a new GUC. It's not difficult to create
examples that would benefit from small/large memory limit, depending on
the number of subscriptions etc.

I do however agree the GUC does not have to be tied to any existing one,
it was just an attempt to use a more sensible default value. I do think
m_w_m would be fine, but I can live with using an explicit value.

So that's what I did in the attached patch - I've renamed the GUC to
logical_decoding_work_mem, detached it from m_w_m and set the default to
64MB (i.e. the same default as m_w_m). It should also fix all the issues
from the recent reviews (at least I believe so).

I've realized that one of the subsequent patches allows overriding the
limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
I think it'd be good to move this bit forward, but I think it can be
done in a separate patch.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Sep 26, 2019 at 11:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> No, that's a good question, and I'm not sure what the answer is at the
> moment. My understanding was that the infrastructure in the 2PC patch is
> enough even for subtransactions, but I might be wrong. I need to think
> about that for a while.
>
IIUC, for 2PC it's enough to check whether the main transaction is
aborted or not but for the in-progress transaction it's possible that
the current subtransaction might have done catalog changes and it
might get aborted when we are decoding.  So we need to extend an
infrastructure such that we can check the status of the transaction
for which we are decoding the change.  Also, I think we need to handle
the ERRCODE_TRANSACTION_ROLLBACK and ignore it.

I have attached a small patch to handle this which can be applied on
top of your patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:
> >On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
>
> I do think having a separate GUC is a must, irrespectedly of what other
> GUC (if any) is used as a default. You're right the maintenance_work_mem
> value might be too high (e.g. in cases with many subscriptions), but the
> same issue applies to work_mem - there's no guarantee work_mem is lower
> than maintenance_work_mem, and in analytics databases it may be set very
> high. So work_mem does not really solve the issue
>
> IMHO we can't really do without a new GUC. It's not difficult to create
> examples that would benefit from small/large memory limit, depending on
> the number of subscriptions etc.
>
> I do however agree the GUC does not have to be tied to any existing one,
> it was just an attempt to use a more sensible default value. I do think
> m_w_m would be fine, but I can live with using an explicit value.
>
> So that's what I did in the attached patch - I've renamed the GUC to
> logical_decoding_work_mem, detached it from m_w_m and set the default to
> 64MB (i.e. the same default as m_w_m).

Fair enough, let's not argue more on this unless someone else wants to
share his opinion.

> It should also fix all the issues
> from the recent reviews (at least I believe so).
>

Have you given any thought on creating a test case for this patch?  I
think you also told that you will test some worst-case scenarios and
report the numbers so that we are convinced that the current eviction
algorithm is good.

> I've realized that one of the subsequent patches allows overriding the
> limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
> I think it'd be good to move this bit forward, but I think it can be
> done in a separate patch.
>

Yeah, it is better to deal it separately as I am also not entirely
convinced at this stage about this parameter.  I have mentioned the
same in the previous email as well.

While glancing through the changes, I noticed a small thing:
+#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem

I guess this need to be updated.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alvaro Herrera
Дата:
On 2019-Sep-29, Amit Kapila wrote:

> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> > So that's what I did in the attached patch - I've renamed the GUC to
> > logical_decoding_work_mem, detached it from m_w_m and set the default to
> > 64MB (i.e. the same default as m_w_m).
> 
> Fair enough, let's not argue more on this unless someone else wants to
> share his opinion.

I just read this part of the conversation and I agree that having a
separate GUC with its own value independent from other GUCs is a good
solution.  Tying it to m_w_m seemed reasonable, but it's true that
people frequently set m_w_m very high, and it would be undesirable to
propagate that value to logical decoding memory usage.


I wonder what would constitute good advice on how to set this value, I
mean what is the metric that the user needs to be thinking about.   Is
it the total of memory required to keep all concurrent write transactions 
in memory?  (Quick example: if you do 2048 wTPS and each transaction
lasts 1s, and each transaction does 1kB of logically-decoded changes,
then ~2MB are sufficient for the average case.  Is that correct?  I
*think* that full-page images do not count, correct?  With these things
in mind users could go through pg_waldump output and figure out what to
set the value to.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Sun, Sep 29, 2019 at 02:30:44PM -0300, Alvaro Herrera wrote:
>On 2019-Sep-29, Amit Kapila wrote:
>
>> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>> > So that's what I did in the attached patch - I've renamed the GUC to
>> > logical_decoding_work_mem, detached it from m_w_m and set the default to
>> > 64MB (i.e. the same default as m_w_m).
>>
>> Fair enough, let's not argue more on this unless someone else wants to
>> share his opinion.
>
>I just read this part of the conversation and I agree that having a
>separate GUC with its own value independent from other GUCs is a good
>solution.  Tying it to m_w_m seemed reasonable, but it's true that
>people frequently set m_w_m very high, and it would be undesirable to
>propagate that value to logical decoding memory usage.
>
>
>I wonder what would constitute good advice on how to set this value, I
>mean what is the metric that the user needs to be thinking about.   Is
>it the total of memory required to keep all concurrent write transactions
>in memory?  (Quick example: if you do 2048 wTPS and each transaction
>lasts 1s, and each transaction does 1kB of logically-decoded changes,
>then ~2MB are sufficient for the average case.  Is that correct? 

Yes, something like that. Essentially we'd like to keep all concurrent
transactions decoded in memory, to eliminate the need to spill to disk.
One of the subsequent patches adds some subscription-level stats, so
maybe we don't need to worry about this too much - the stats seem like a
better source of information for tuning.

>I *think* that full-page images do not count, correct?  With these
>things in mind users could go through pg_waldump output and figure out
>what to set the value to.)
>

Right, FPW do not matter here.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
>
> Yeah, it is better to deal it separately as I am also not entirely
> convinced at this stage about this parameter.  I have mentioned the
> same in the previous email as well.
>
> While glancing through the changes, I noticed a small thing:
> +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem
>
> I guess this need to be updated.
>

On further testing, I found that the patch seems to have problems with toast.  Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);  --kaboom

The second statement in Session-2 leads to a crash.

Other than that, I am not sure if the changes related to spill to disk after logical_decoding_work_mem works for toast table as I couldn't hit that code for toast table case, but I might be missing something.  As mentioned previously, I feel there should be some way to test whether this patch works for the cases it claims to work.  As of now, I have to check via debugging.  Let me know if there is any way, I can test this.

I am reluctant to say, but I think this patch still needs some more work (review, test, rework) before we can commit it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com>
>wrote:
>> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> >
>>
>> Yeah, it is better to deal it separately as I am also not entirely
>> convinced at this stage about this parameter.  I have mentioned the
>> same in the previous email as well.
>>
>> While glancing through the changes, I noticed a small thing:
>> +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
>maintenance_work_mem
>>
>> I guess this need to be updated.
>>
>
>On further testing, I found that the patch seems to have problems with
>toast.  Consider below scenario:
>Session-1
>Create table large_text(t1 text);
>INSERT INTO large_text
>SELECT (SELECT string_agg('x', ',')
>FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>
>Session-2
>SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>'test_decoding');
>SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>*--kaboom*
>
>The second statement in Session-2 leads to a crash.
>

OK, thanks for the report - will investigate.

>Other than that, I am not sure if the changes related to spill to disk
>after logical_decoding_work_mem works for toast table as I couldn't hit
>that code for toast table case, but I might be missing something.  As
>mentioned previously, I feel there should be some way to test whether this
>patch works for the cases it claims to work.  As of now, I have to check
>via debugging.  Let me know if there is any way, I can test this.
>

That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.

>I am reluctant to say, but I think this patch still needs some more work
>(review, test, rework) before we can commit it.
>

I agreee.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>
>On further testing, I found that the patch seems to have problems with
>toast.  Consider below scenario:
>Session-1
>Create table large_text(t1 text);
>INSERT INTO large_text
>SELECT (SELECT string_agg('x', ',')
>FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>
>Session-2
>SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>'test_decoding');
>SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>*--kaboom*
>
>The second statement in Session-2 leads to a crash.
>

OK, thanks for the report - will investigate.

It was an assertion failure in ReorderBufferCleanupTXN at below line:
+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);

 
>Other than that, I am not sure if the changes related to spill to disk
>after logical_decoding_work_mem works for toast table as I couldn't hit
>that code for toast table case, but I might be missing something.  As
>mentioned previously, I feel there should be some way to test whether this
>patch works for the cases it claims to work.  As of now, I have to check
>via debugging.  Let me know if there is any way, I can test this.
>

That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.


I like that idea, but I think you need to split that patch to only get the stats related to the spill.  It would be easier to review if you can prepare that atop of 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
>On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>> >
>> >On further testing, I found that the patch seems to have problems with
>> >toast.  Consider below scenario:
>> >Session-1
>> >Create table large_text(t1 text);
>> >INSERT INTO large_text
>> >SELECT (SELECT string_agg('x', ',')
>> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>> >
>> >Session-2
>> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>> >'test_decoding');
>> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>> >*--kaboom*
>> >
>> >The second statement in Session-2 leads to a crash.
>> >
>>
>> OK, thanks for the report - will investigate.
>>
>
>It was an assertion failure in ReorderBufferCleanupTXN at below line:
>+ /* Check we're not mixing changes from different transactions. */
>+ Assert(change->txn == txn);
>

Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.

>> >Other than that, I am not sure if the changes related to spill to disk
>> >after logical_decoding_work_mem works for toast table as I couldn't hit
>> >that code for toast table case, but I might be missing something.  As
>> >mentioned previously, I feel there should be some way to test whether this
>> >patch works for the cases it claims to work.  As of now, I have to check
>> >via debugging.  Let me know if there is any way, I can test this.
>> >
>>
>> That's one of the reasons why I proposed to move the statistics (which
>> say how many transactions / bytes were spilled to disk) from a later
>> patch in the series. I don't think there's a better way.
>>
>>
>I like that idea, but I think you need to split that patch to only get the
>stats related to the spill.  It would be easier to review if you can
>prepare that atop of
>0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
>

Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming.  We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
I have attempted to test the performance of (Stream + Spill) vs
(Stream + BGW pool) and I can see the similar gain what Alexey had
shown[1].

In addition to this, I have rebased the latest patchset [2] without
the two-phase logical decoding patch set.

Test results:
I have repeated the same test as Alexy[1] for 1kk and 1kk data and
here is my result
Stream + Spill
N           time on master(sec)   Total xact time (sec)
1kk               6                               21
3kk             18                               55

Stream + BGW pool
N          time on master(sec)  Total xact time (sec)
1kk              6                              13
3kk            19                              35

Patch details:
All the patches are the same as posted on [2] except
1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have
removed the handling of error which is specific for 2PC
2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC
3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New
patch to handle concurrent abort error for the in-progress transaction
and also add handling for the sub transaction's abort.
4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased
Alexey's patch

[1] https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru
[2] https://www.postgresql.org/message-id/20190928190917.hrpknmq76v3ts3lj%40development

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> >wrote:
> >
> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
> >> >
> >> >On further testing, I found that the patch seems to have problems with
> >> >toast.  Consider below scenario:
> >> >Session-1
> >> >Create table large_text(t1 text);
> >> >INSERT INTO large_text
> >> >SELECT (SELECT string_agg('x', ',')
> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> >> >
> >> >Session-2
> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
> >> >'test_decoding');
> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
> >> >*--kaboom*
> >> >
> >> >The second statement in Session-2 leads to a crash.
> >> >
> >>
> >> OK, thanks for the report - will investigate.
> >>
> >
> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
> >+ /* Check we're not mixing changes from different transactions. */
> >+ Assert(change->txn == txn);
> >
>
> Can you still reproduce this issue with the patch I sent on 28/9? I have
> been unable to trigger the failure, and it seems pretty similar to the
> failure you reported (and I fixed) on 28/9.
>
> >> >Other than that, I am not sure if the changes related to spill to disk
> >> >after logical_decoding_work_mem works for toast table as I couldn't hit
> >> >that code for toast table case, but I might be missing something.  As
> >> >mentioned previously, I feel there should be some way to test whether this
> >> >patch works for the cases it claims to work.  As of now, I have to check
> >> >via debugging.  Let me know if there is any way, I can test this.
> >> >
> >>
> >> That's one of the reasons why I proposed to move the statistics (which
> >> say how many transactions / bytes were spilled to disk) from a later
> >> patch in the series. I don't think there's a better way.
> >>
> >>
> >I like that idea, but I think you need to split that patch to only get the
> >stats related to the spill.  It would be easier to review if you can
> >prepare that atop of
> >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
> >
>
> Sure, I wasn't really proposing to adding all stats from that patch,
> including those related to streaming.  We need to extract just those
> related to spilling. And yes, it needs to be moved right after 0001.
>
> regards
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
>On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>> >
>> >On further testing, I found that the patch seems to have problems with
>> >toast.  Consider below scenario:
>> >Session-1
>> >Create table large_text(t1 text);
>> >INSERT INTO large_text
>> >SELECT (SELECT string_agg('x', ',')
>> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>> >
>> >Session-2
>> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>> >'test_decoding');
>> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>> >*--kaboom*
>> >
>> >The second statement in Session-2 leads to a crash.
>> >
>>
>> OK, thanks for the report - will investigate.
>>
>
>It was an assertion failure in ReorderBufferCleanupTXN at below line:
>+ /* Check we're not mixing changes from different transactions. */
>+ Assert(change->txn == txn);
>

Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.

Yes, it seems we need a similar change in ReorderBufferAddNewTupleCids.  I think in session-2 you need to create replication slot before creating table in session-1 to see this problem.

--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
        change->data.tuplecid.cmax = cmax;
        change->data.tuplecid.combocid = combocid;
        change->lsn = lsn;
+       change->txn = txn;
        change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
        dlist_push_tail(&txn->tuplecids, &change->node);

Few more comments:
-----------------------------------
1.
+static bool
+check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
+{
+ /*
+ * -1 indicates fallback.
+ *
+ * If we haven't yet changed the boot_val default of -1, just let it be.
+ * logical decoding will look to maintenance_work_mem instead.
+ */
+ if (*newval == -1)
+ return true;
+
+ /*
+ * We clamp manually-set values to at least 64kB. The maintenance_work_mem
+ * uses a higher minimum value (1MB), so this is OK.
+ */
+ if (*newval < 64)
+ *newval = 64;

I think this needs to be changed as now we don't rely on maintenance_work_mem.  Another thing related to this is that I think the default value for logical_decoding_work_mem still seems to be -1.  We need to make it to 64MB.  I have seen this while debugging memory accounting changes.  I think this is the reason why I was not seeing toast related changes being serialized because, in that test, I haven't changed the default value of logical_decoding_work_mem.

2.
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */


/going modify/going to modify/

3.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
  */
 static void
 ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
  if (txn->toast_hash == NULL)
  return;
 
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
+ ReorderBufferChangeMemoryUpdate(rb, change, false);

It is not very clear why this change is required.  Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes.  This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem.  If possible, can you explain it with an example?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have attempted to test the performance of (Stream + Spill) vs
> (Stream + BGW pool) and I can see the similar gain what Alexey had
> shown[1].
>
> In addition to this, I have rebased the latest patchset [2] without
> the two-phase logical decoding patch set.
>
> Test results:
> I have repeated the same test as Alexy[1] for 1kk and 1kk data and
> here is my result
> Stream + Spill
> N           time on master(sec)   Total xact time (sec)
> 1kk               6                               21
> 3kk             18                               55
>
> Stream + BGW pool
> N          time on master(sec)  Total xact time (sec)
> 1kk              6                              13
> 3kk            19                              35
>
> Patch details:
> All the patches are the same as posted on [2] except
> 1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have
> removed the handling of error which is specific for 2PC

Here[1], I mentioned that I have removed the 2PC changes from
this[0006] patch but mistakenly I attached the original patch itself
instead of the modified version. So attaching the modified version of
only this patch other patches are the same.

> 2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC
> 3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New
> patch to handle concurrent abort error for the in-progress transaction
> and also add handling for the sub transaction's abort.
> 4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased
> Alexey's patch

[1] https://www.postgresql.org/message-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
>> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> >wrote:
>> >
>> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>> >> >
>> >> >On further testing, I found that the patch seems to have problems with
>> >> >toast.  Consider below scenario:
>> >> >Session-1
>> >> >Create table large_text(t1 text);
>> >> >INSERT INTO large_text
>> >> >SELECT (SELECT string_agg('x', ',')
>> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>> >> >
>> >> >Session-2
>> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>> >> >'test_decoding');
>> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>> >> >*--kaboom*
>> >> >
>> >> >The second statement in Session-2 leads to a crash.
>> >> >
>> >>
>> >> OK, thanks for the report - will investigate.
>> >>
>> >
>> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
>> >+ /* Check we're not mixing changes from different transactions. */
>> >+ Assert(change->txn == txn);
>> >
>>
>> Can you still reproduce this issue with the patch I sent on 28/9? I have
>> been unable to trigger the failure, and it seems pretty similar to the
>> failure you reported (and I fixed) on 28/9.
>
>
> Yes, it seems we need a similar change in ReorderBufferAddNewTupleCids.  I think in session-2 you need to create
replicationslot before creating table in session-1 to see this problem. 
>
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
>         change->data.tuplecid.cmax = cmax;
>         change->data.tuplecid.combocid = combocid;
>         change->lsn = lsn;
> +       change->txn = txn;
>         change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
>         dlist_push_tail(&txn->tuplecids, &change->node);
>
> Few more comments:
> -----------------------------------
> 1.
> +static bool
> +check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
> +{
> + /*
> + * -1 indicates fallback.
> + *
> + * If we haven't yet changed the boot_val default of -1, just let it be.
> + * logical decoding will look to maintenance_work_mem instead.
> + */
> + if (*newval == -1)
> + return true;
> +
> + /*
> + * We clamp manually-set values to at least 64kB. The maintenance_work_mem
> + * uses a higher minimum value (1MB), so this is OK.
> + */
> + if (*newval < 64)
> + *newval = 64;
>
> I think this needs to be changed as now we don't rely on maintenance_work_mem.  Another thing related to this is that
Ithink the default value for logical_decoding_work_mem still seems to be -1.  We need to make it to 64MB.  I have seen
thiswhile debugging memory accounting changes.  I think this is the reason why I was not seeing toast related changes
beingserialized because, in that test, I haven't changed the default value of logical_decoding_work_mem. 
>
> 2.
> + /*
> + * We're going modify the size of the change, so to make sure the
> + * accounting is correct we'll make it look like we're removing the
> + * change now (with the old size), and then re-add it at the end.
> + */
>
>
> /going modify/going to modify/
>
> 3.
> + *
> + * While updating the existing change with detoasted tuple data, we need to
> + * update the memory accounting info, because the change size will differ.
> + * Otherwise the accounting may get out of sync, triggering serialization
> + * at unexpected times.
> + *
> + * We simply subtract size of the change before rejiggering the tuple, and
> + * then adding the new size. This makes it look like the change was removed
> + * and then added back, except it only tweaks the accounting info.
> + *
> + * In particular it can't trigger serialization, which would be pointless
> + * anyway as it happens during commit processing right before handing
> + * the change to the output plugin.
>   */
>  static void
>  ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
>   if (txn->toast_hash == NULL)
>   return;
>
> + /*
> + * We're going modify the size of the change, so to make sure the
> + * accounting is correct we'll make it look like we're removing the
> + * change now (with the old size), and then re-add it at the end.
> + */
> + ReorderBufferChangeMemoryUpdate(rb, change, false);
>
> It is not very clear why this change is required.  Basically, this is done at commit time after which actually we
shouldn'tattempt to spill these changes.  This is mentioned in comments as well, but it is not clear if that is the
case,then how and when accounting can create a problem.  If possible, can you explain it with an example? 
>
IIUC, we are keeping the track of the memory in ReorderBuffer which is
common across the transactions.  So even if this transaction is
committing and will not spill to dis but we need to keep the memory
accounting correct for the future changes in other transactions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 3.
> > + *
> > + * While updating the existing change with detoasted tuple data, we need to
> > + * update the memory accounting info, because the change size will differ.
> > + * Otherwise the accounting may get out of sync, triggering serialization
> > + * at unexpected times.
> > + *
> > + * We simply subtract size of the change before rejiggering the tuple, and
> > + * then adding the new size. This makes it look like the change was removed
> > + * and then added back, except it only tweaks the accounting info.
> > + *
> > + * In particular it can't trigger serialization, which would be pointless
> > + * anyway as it happens during commit processing right before handing
> > + * the change to the output plugin.
> >   */
> >  static void
> >  ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> >   if (txn->toast_hash == NULL)
> >   return;
> >
> > + /*
> > + * We're going modify the size of the change, so to make sure the
> > + * accounting is correct we'll make it look like we're removing the
> > + * change now (with the old size), and then re-add it at the end.
> > + */
> > + ReorderBufferChangeMemoryUpdate(rb, change, false);
> >
> > It is not very clear why this change is required.  Basically, this is done at commit time after which actually we
shouldn'tattempt to spill these changes.  This is mentioned in comments as well, but it is not clear if that is the
case,then how and when accounting can create a problem.  If possible, can you explain it with an example? 
> >
> IIUC, we are keeping the track of the memory in ReorderBuffer which is
> common across the transactions.  So even if this transaction is
> committing and will not spill to dis but we need to keep the memory
> accounting correct for the future changes in other transactions.
>

You are right.  I somehow missed that we need to keep the size
computation in sync even during commit for other in-progress
transactions in the ReorderBuffer.  You can ignore this point or maybe
slightly adjust the comment to make it explicit.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Craig Ringer
Дата:
On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 3.
> > + *
> > + * While updating the existing change with detoasted tuple data, we need to
> > + * update the memory accounting info, because the change size will differ.
> > + * Otherwise the accounting may get out of sync, triggering serialization
> > + * at unexpected times.
> > + *
> > + * We simply subtract size of the change before rejiggering the tuple, and
> > + * then adding the new size. This makes it look like the change was removed
> > + * and then added back, except it only tweaks the accounting info.
> > + *
> > + * In particular it can't trigger serialization, which would be pointless
> > + * anyway as it happens during commit processing right before handing
> > + * the change to the output plugin.
> >   */
> >  static void
> >  ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> >   if (txn->toast_hash == NULL)
> >   return;
> >
> > + /*
> > + * We're going modify the size of the change, so to make sure the
> > + * accounting is correct we'll make it look like we're removing the
> > + * change now (with the old size), and then re-add it at the end.
> > + */
> > + ReorderBufferChangeMemoryUpdate(rb, change, false);
> >
> > It is not very clear why this change is required.  Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes.  This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem.  If possible, can you explain it with an example?
> >
> IIUC, we are keeping the track of the memory in ReorderBuffer which is
> common across the transactions.  So even if this transaction is
> committing and will not spill to dis but we need to keep the memory
> accounting correct for the future changes in other transactions.
>

You are right.  I somehow missed that we need to keep the size
computation in sync even during commit for other in-progress
transactions in the ReorderBuffer.  You can ignore this point or maybe
slightly adjust the comment to make it explicit.

Does anyone object if we add the reorder buffer total size & in-memory size to struct WalSnd too, so we can report it in pg_stat_replication? 

I can follow up with a patch to add on top of this one if you think it's reasonable. I'll also take the opportunity to add a number of tracepoints across the walsender and logical decoding, since right now it's very opaque in production systems ... and everyone just LOVES hunting down debug syms and attaching gdb to production DBs.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Oct 14, 2019 at 6:51 AM Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>
>
> Does anyone object if we add the reorder buffer total size & in-memory size to struct WalSnd too, so we can report it
inpg_stat_replication? 
>

There is already a patch
(0011-Track-statistics-for-streaming-spilling) in this series posted
by Tomas[1] which tracks important statistics in WalSnd which I think
are good enough.  Have you checked that?  I am not sure if adding
additional size will help, but I might be missing something.

> I can follow up with a patch to add on top of this one if you think it's reasonable. I'll also take the opportunity
toadd a number of tracepoints across the walsender and logical decoding, since right now it's very opaque in production
systems... and everyone just LOVES hunting down debug syms and attaching gdb to production DBs. 
>

Sure, adding tracepoints can be helpful, but isn't it better to start
that as a separate thread?

[1] - https://www.postgresql.org/message-id/20190928190917.hrpknmq76v3ts3lj%40development

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> >wrote:
> >
> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
> >> >
> >> >On further testing, I found that the patch seems to have problems with
> >> >toast.  Consider below scenario:
> >> >Session-1
> >> >Create table large_text(t1 text);
> >> >INSERT INTO large_text
> >> >SELECT (SELECT string_agg('x', ',')
> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> >> >
> >> >Session-2
> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
> >> >'test_decoding');
> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
> >> >*--kaboom*
> >> >
> >> >The second statement in Session-2 leads to a crash.
> >> >
> >>
> >> OK, thanks for the report - will investigate.
> >>
> >
> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
> >+ /* Check we're not mixing changes from different transactions. */
> >+ Assert(change->txn == txn);
> >
>
> Can you still reproduce this issue with the patch I sent on 28/9? I have
> been unable to trigger the failure, and it seems pretty similar to the
> failure you reported (and I fixed) on 28/9.
>
> >> >Other than that, I am not sure if the changes related to spill to disk
> >> >after logical_decoding_work_mem works for toast table as I couldn't hit
> >> >that code for toast table case, but I might be missing something.  As
> >> >mentioned previously, I feel there should be some way to test whether this
> >> >patch works for the cases it claims to work.  As of now, I have to check
> >> >via debugging.  Let me know if there is any way, I can test this.
> >> >
> >>
> >> That's one of the reasons why I proposed to move the statistics (which
> >> say how many transactions / bytes were spilled to disk) from a later
> >> patch in the series. I don't think there's a better way.
> >>
> >>
> >I like that idea, but I think you need to split that patch to only get the
> >stats related to the spill.  It would be easier to review if you can
> >prepare that atop of
> >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
> >
>
> Sure, I wasn't really proposing to adding all stats from that patch,
> including those related to streaming.  We need to extract just those
> related to spilling. And yes, it needs to be moved right after 0001.
>
I have extracted the spilling related code to a separate patch on top
of 0001.  I have also fixed some bugs and review comments and attached
as a separate patch.  Later I can merge it to the main patch if you
agree with the changes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> >
> > Sure, I wasn't really proposing to adding all stats from that patch,
> > including those related to streaming.  We need to extract just those
> > related to spilling. And yes, it needs to be moved right after 0001.
> >
> I have extracted the spilling related code to a separate patch on top
> of 0001.  I have also fixed some bugs and review comments and attached
> as a separate patch.  Later I can merge it to the main patch if you
> agree with the changes.
>

Few comments
-------------------------
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
1.
+ {
+ {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
+ gettext_noop("Sets the maximum memory to be used for logical decoding."),
+ gettext_noop("This much memory can be used by each internal "
+ "reorder buffer before spilling to disk or streaming."),
+ GUC_UNIT_KB
+ },

I think we can remove 'or streaming' from above sentence for now.  We
can add it later with later patch where streaming will be allowed.

2.
@@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
class="parameter">subscription_name</replaceabl
          </para>
         </listitem>
        </varlistentry>
+
+       <varlistentry>
+        <term><literal>work_mem</literal> (<type>integer</type>)</term>
+        <listitem>
+         <para>
+          Limits the amount of memory used to decode changes on the
+          publisher.  If not specified, the publisher will use the default
+          specified by <varname>logical_decoding_work_mem</varname>. When
+          needed, additional data are spilled to disk.
+         </para>
+        </listitem>
+       </varlistentry>

It is not clear why we need this parameter at least with this patch?
I have raised this multiple times [1][2].

bugs_and_review_comments_fix
1.
},
  &logical_decoding_work_mem,
- -1, -1, MAX_KILOBYTES,
- check_logical_decoding_work_mem, NULL, NULL
+ 65536, 64, MAX_KILOBYTES,
+ NULL, NULL, NULL

I think the default value should be 1MB similar to
maintenance_work_mem.  The same was true before this change.

2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
maintenance_work_mem
+i#logical_decoding_work_mem = 64MB # min 64kB

It seems the 'i' is a leftover character in the above change.  Also,
change the default value considering the previous point.

3.
@@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

  /* update the statistics */
  rb->spillCount += 1;
- rb->spillTxns += txn->serialized ? 1 : 0;
+ rb->spillTxns += txn->serialized ? 0 : 1;
  rb->spillBytes += size;

Why is this change required?  Shouldn't we increase the spillTxns
count only when the txn is serialized?

0002-Track-statistics-for-spilling
1.
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>

The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem

2.
+    <row>
+     <entry><structfield>spill_txns</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of transactions spilled to disk after the memory used by
+      logical decoding exceeds <literal>logical_work_mem</literal>. The
+      counter gets incremented both for toplevel transactions and
+      subtransactions.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of times transactions were spilled to disk. Transactions
+      may get spilled repeatedly, and this counter gets incremented on every
+      such invocation.
+      </entry>
+    </row>
+    <row>
+     <entry><structfield>spill_bytes</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Amount of decoded transaction data spilled to disk.
+      </entry>
+    </row>

In all the above cases, the explanation text starts immediately after
<entry> tag, but the general coding practice is to start from the next
line, see the explanation of nearby parameters.

It seems these parameters are added in pg-stat-wal-receiver-view in
the docs, but in code, it is present as part of pg_stat_replication.
It seems doc needs to be updated.  Am, I missing something?

3.
ReorderBufferSerializeTXN()
{
..
/* update the statistics */
rb->spillCount += 1;
rb->spillTxns += txn->serialized ? 0 : 1;
rb->spillBytes += size;

Assert(spilled == txn->nentries_mem);
Assert(dlist_is_empty(&txn->changes));
txn->nentries_mem = 0;
txn->serialized = true;
..
}

I am not able to understand the above code.  We are setting the
serialized parameter a few lines after we check it and increment the
spillTxns count. Can you please explain it?

Also, isn't spillTxns count bit confusing, because in some cases it
will include subtransactions and other cases (where the largest picked
transaction is a subtransaction) it won't include it?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have replied to some of your questions inline.  I will work on the
remaining comments and post the patch for the same.

> > >
> > > Sure, I wasn't really proposing to adding all stats from that patch,
> > > including those related to streaming.  We need to extract just those
> > > related to spilling. And yes, it needs to be moved right after 0001.
> > >
> > I have extracted the spilling related code to a separate patch on top
> > of 0001.  I have also fixed some bugs and review comments and attached
> > as a separate patch.  Later I can merge it to the main patch if you
> > agree with the changes.
> >
>
> Few comments
> -------------------------
> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
> 1.
> + {
> + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
> + gettext_noop("Sets the maximum memory to be used for logical decoding."),
> + gettext_noop("This much memory can be used by each internal "
> + "reorder buffer before spilling to disk or streaming."),
> + GUC_UNIT_KB
> + },
>
> I think we can remove 'or streaming' from above sentence for now.  We
> can add it later with later patch where streaming will be allowed.
>
> 2.
> @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
> class="parameter">subscription_name</replaceabl
>           </para>
>          </listitem>
>         </varlistentry>
> +
> +       <varlistentry>
> +        <term><literal>work_mem</literal> (<type>integer</type>)</term>
> +        <listitem>
> +         <para>
> +          Limits the amount of memory used to decode changes on the
> +          publisher.  If not specified, the publisher will use the default
> +          specified by <varname>logical_decoding_work_mem</varname>. When
> +          needed, additional data are spilled to disk.
> +         </para>
> +        </listitem>
> +       </varlistentry>
>
> It is not clear why we need this parameter at least with this patch?
> I have raised this multiple times [1][2].
>
> bugs_and_review_comments_fix
> 1.
> },
>   &logical_decoding_work_mem,
> - -1, -1, MAX_KILOBYTES,
> - check_logical_decoding_work_mem, NULL, NULL
> + 65536, 64, MAX_KILOBYTES,
> + NULL, NULL, NULL
>
> I think the default value should be 1MB similar to
> maintenance_work_mem.  The same was true before this change.
>
> 2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
> maintenance_work_mem
> +i#logical_decoding_work_mem = 64MB # min 64kB
>
> It seems the 'i' is a leftover character in the above change.  Also,
> change the default value considering the previous point.
>
> 3.
> @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>
>   /* update the statistics */
>   rb->spillCount += 1;
> - rb->spillTxns += txn->serialized ? 1 : 0;
> + rb->spillTxns += txn->serialized ? 0 : 1;
>   rb->spillBytes += size;
>
> Why is this change required?  Shouldn't we increase the spillTxns
> count only when the txn is serialized?

Prior to this change it was increasing the rb->spillTxns, every time
we try to serialize the changes of the transaction.  Now, only we
increase first time when it is not yet serialized.

> 0002-Track-statistics-for-spilling
> 1.
> +    <row>
> +     <entry><structfield>spill_txns</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Number of transactions spilled to disk after the memory used by
> +      logical decoding exceeds <literal>logical_work_mem</literal>. The
> +      counter gets incremented both for toplevel transactions and
> +      subtransactions.
> +      </entry>
> +    </row>
>
> The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem
>
> 2.
> +    <row>
> +     <entry><structfield>spill_txns</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Number of transactions spilled to disk after the memory used by
> +      logical decoding exceeds <literal>logical_work_mem</literal>. The
> +      counter gets incremented both for toplevel transactions and
> +      subtransactions.
> +      </entry>
> +    </row>
> +    <row>
> +     <entry><structfield>spill_count</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Number of times transactions were spilled to disk. Transactions
> +      may get spilled repeatedly, and this counter gets incremented on every
> +      such invocation.
> +      </entry>
> +    </row>
> +    <row>
> +     <entry><structfield>spill_bytes</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Amount of decoded transaction data spilled to disk.
> +      </entry>
> +    </row>
>
> In all the above cases, the explanation text starts immediately after
> <entry> tag, but the general coding practice is to start from the next
> line, see the explanation of nearby parameters.
>
> It seems these parameters are added in pg-stat-wal-receiver-view in
> the docs, but in code, it is present as part of pg_stat_replication.
> It seems doc needs to be updated.  Am, I missing something?
>
> 3.
> ReorderBufferSerializeTXN()
> {
> ..
> /* update the statistics */
> rb->spillCount += 1;
> rb->spillTxns += txn->serialized ? 0 : 1;
> rb->spillBytes += size;
>
> Assert(spilled == txn->nentries_mem);
> Assert(dlist_is_empty(&txn->changes));
> txn->nentries_mem = 0;
> txn->serialized = true;
> ..
> }
>
> I am not able to understand the above code.  We are setting the
> serialized parameter a few lines after we check it and increment the
> spillTxns count. Can you please explain it?

Basically, when the first time we attempt to serialize a transaction,
txn->serialized will be false, that time we will increment the
rb->spillTxns and after that set txn->serialized to true.  From next
time onwards if we try to serialize the same transaction we will not
increment the rb->spillTxns so that we count each transaction only
once.

>
> Also, isn't spillTxns count bit confusing, because in some cases it
> will include subtransactions and other cases (where the largest picked
> transaction is a subtransaction) it won't include it?

I did not understand your comment completely.  Basically,  every
transaction which we are serializing we will increase the count first
time right? whether it is the main transaction or the sub-transaction.
Am I missing something?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Oct 21, 2019 at 10:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > 3.
> > @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> > ReorderBufferTXN *txn)
> >
> >   /* update the statistics */
> >   rb->spillCount += 1;
> > - rb->spillTxns += txn->serialized ? 1 : 0;
> > + rb->spillTxns += txn->serialized ? 0 : 1;
> >   rb->spillBytes += size;
> >
> > Why is this change required?  Shouldn't we increase the spillTxns
> > count only when the txn is serialized?
>
> Prior to this change it was increasing the rb->spillTxns, every time
> we try to serialize the changes of the transaction.  Now, only we
> increase first time when it is not yet serialized.
>
> >
> > 3.
> > ReorderBufferSerializeTXN()
> > {
> > ..
> > /* update the statistics */
> > rb->spillCount += 1;
> > rb->spillTxns += txn->serialized ? 0 : 1;
> > rb->spillBytes += size;
> >
> > Assert(spilled == txn->nentries_mem);
> > Assert(dlist_is_empty(&txn->changes));
> > txn->nentries_mem = 0;
> > txn->serialized = true;
> > ..
> > }
> >
> > I am not able to understand the above code.  We are setting the
> > serialized parameter a few lines after we check it and increment the
> > spillTxns count. Can you please explain it?
>
> Basically, when the first time we attempt to serialize a transaction,
> txn->serialized will be false, that time we will increment the
> rb->spillTxns and after that set txn->serialized to true.  From next
> time onwards if we try to serialize the same transaction we will not
> increment the rb->spillTxns so that we count each transaction only
> once.
>

Your explanation for both the above comments makes sense to me.  Can
you please add some comments along these lines because it is not
apparent why one wants to increase the spillTxns counter when
txn->serialized is false?

> >
> > Also, isn't spillTxns count bit confusing, because in some cases it
> > will include subtransactions and other cases (where the largest picked
> > transaction is a subtransaction) it won't include it?
>
> I did not understand your comment completely.  Basically,  every
> transaction which we are serializing we will increase the count first
> time right? whether it is the main transaction or the sub-transaction.
>

It was not clear to me earlier whether we always increase the
spillTxns counter for subtransactions or not.  But now, looking at
code carefully, it is clear that is it is getting increased in every
case.  In short, you don't need to do anything for this comment.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Oct 21, 2019 at 2:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Oct 21, 2019 at 10:48 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > 3.
> > > @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> > > ReorderBufferTXN *txn)
> > >
> > >   /* update the statistics */
> > >   rb->spillCount += 1;
> > > - rb->spillTxns += txn->serialized ? 1 : 0;
> > > + rb->spillTxns += txn->serialized ? 0 : 1;
> > >   rb->spillBytes += size;
> > >
> > > Why is this change required?  Shouldn't we increase the spillTxns
> > > count only when the txn is serialized?
> >
> > Prior to this change it was increasing the rb->spillTxns, every time
> > we try to serialize the changes of the transaction.  Now, only we
> > increase first time when it is not yet serialized.
> >
> > >
> > > 3.
> > > ReorderBufferSerializeTXN()
> > > {
> > > ..
> > > /* update the statistics */
> > > rb->spillCount += 1;
> > > rb->spillTxns += txn->serialized ? 0 : 1;
> > > rb->spillBytes += size;
> > >
> > > Assert(spilled == txn->nentries_mem);
> > > Assert(dlist_is_empty(&txn->changes));
> > > txn->nentries_mem = 0;
> > > txn->serialized = true;
> > > ..
> > > }
> > >
> > > I am not able to understand the above code.  We are setting the
> > > serialized parameter a few lines after we check it and increment the
> > > spillTxns count. Can you please explain it?
> >
> > Basically, when the first time we attempt to serialize a transaction,
> > txn->serialized will be false, that time we will increment the
> > rb->spillTxns and after that set txn->serialized to true.  From next
> > time onwards if we try to serialize the same transaction we will not
> > increment the rb->spillTxns so that we count each transaction only
> > once.
> >
>
> Your explanation for both the above comments makes sense to me.  Can
> you please add some comments along these lines because it is not
> apparent why one wants to increase the spillTxns counter when
> txn->serialized is false?
Ok, I will add comments in the next patch.
>
> > >
> > > Also, isn't spillTxns count bit confusing, because in some cases it
> > > will include subtransactions and other cases (where the largest picked
> > > transaction is a subtransaction) it won't include it?
> >
> > I did not understand your comment completely.  Basically,  every
> > transaction which we are serializing we will increase the count first
> > time right? whether it is the main transaction or the sub-transaction.
> >
>
> It was not clear to me earlier whether we always increase the
> spillTxns counter for subtransactions or not.  But now, looking at
> code carefully, it is clear that is it is getting increased in every
> case.  In short, you don't need to do anything for this comment.
ok

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > >
> > > Sure, I wasn't really proposing to adding all stats from that patch,
> > > including those related to streaming.  We need to extract just those
> > > related to spilling. And yes, it needs to be moved right after 0001.
> > >
> > I have extracted the spilling related code to a separate patch on top
> > of 0001.  I have also fixed some bugs and review comments and attached
> > as a separate patch.  Later I can merge it to the main patch if you
> > agree with the changes.
> >
>
> Few comments
> -------------------------
> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
> 1.
> + {
> + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
> + gettext_noop("Sets the maximum memory to be used for logical decoding."),
> + gettext_noop("This much memory can be used by each internal "
> + "reorder buffer before spilling to disk or streaming."),
> + GUC_UNIT_KB
> + },
>
> I think we can remove 'or streaming' from above sentence for now.  We
> can add it later with later patch where streaming will be allowed.
Done
>
> 2.
> @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
> class="parameter">subscription_name</replaceabl
>           </para>
>          </listitem>
>         </varlistentry>
> +
> +       <varlistentry>
> +        <term><literal>work_mem</literal> (<type>integer</type>)</term>
> +        <listitem>
> +         <para>
> +          Limits the amount of memory used to decode changes on the
> +          publisher.  If not specified, the publisher will use the default
> +          specified by <varname>logical_decoding_work_mem</varname>. When
> +          needed, additional data are spilled to disk.
> +         </para>
> +        </listitem>
> +       </varlistentry>
>
> It is not clear why we need this parameter at least with this patch?
> I have raised this multiple times [1][2].

I have moved it out as a separate patch (0003) so that if we need that
we need this for the streaming transaction then we can keep this.
>
> bugs_and_review_comments_fix
> 1.
> },
>   &logical_decoding_work_mem,
> - -1, -1, MAX_KILOBYTES,
> - check_logical_decoding_work_mem, NULL, NULL
> + 65536, 64, MAX_KILOBYTES,
> + NULL, NULL, NULL
>
> I think the default value should be 1MB similar to
> maintenance_work_mem.  The same was true before this change.
default value for maintenance_work_mem is also 64MB. Did you mean min value?
>
> 2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
> maintenance_work_mem
> +i#logical_decoding_work_mem = 64MB # min 64kB
>
> It seems the 'i' is a leftover character in the above change.  Also,
> change the default value considering the previous point.
oops, fixed.
>
> 3.
> @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>
>   /* update the statistics */
>   rb->spillCount += 1;
> - rb->spillTxns += txn->serialized ? 1 : 0;
> + rb->spillTxns += txn->serialized ? 0 : 1;
>   rb->spillBytes += size;
>
> Why is this change required?  Shouldn't we increase the spillTxns
> count only when the txn is serialized?
Already agreed in previous mail so added comments
>
> 0002-Track-statistics-for-spilling
> 1.
> +    <row>
> +     <entry><structfield>spill_txns</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Number of transactions spilled to disk after the memory used by
> +      logical decoding exceeds <literal>logical_work_mem</literal>. The
> +      counter gets incremented both for toplevel transactions and
> +      subtransactions.
> +      </entry>
> +    </row>
>
> The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem
done
>
> 2.
> +    <row>
> +     <entry><structfield>spill_txns</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Number of transactions spilled to disk after the memory used by
> +      logical decoding exceeds <literal>logical_work_mem</literal>. The
> +      counter gets incremented both for toplevel transactions and
> +      subtransactions.
> +      </entry>
> +    </row>
> +    <row>
> +     <entry><structfield>spill_count</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Number of times transactions were spilled to disk. Transactions
> +      may get spilled repeatedly, and this counter gets incremented on every
> +      such invocation.
> +      </entry>
> +    </row>
> +    <row>
> +     <entry><structfield>spill_bytes</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Amount of decoded transaction data spilled to disk.
> +      </entry>
> +    </row>
>
> In all the above cases, the explanation text starts immediately after
> <entry> tag, but the general coding practice is to start from the next
> line, see the explanation of nearby parameters.
It seems it's mixed, for example, you can see
   <entry>Timeline number of last write-ahead log location received and
      flushed to disk, the initial value of this field being the timeline
      number of the first log location used when WAL receiver is started
     </entry>

or
    <entry>Timeline number of last write-ahead log location received and
      flushed to disk, the initial value of this field being the timeline
      number of the first log location used when WAL receiver is started
     </entry>

>
> It seems these parameters are added in pg-stat-wal-receiver-view in
> the docs, but in code, it is present as part of pg_stat_replication.
> It seems doc needs to be updated.  Am, I missing something?
Fixed
>
> 3.
> ReorderBufferSerializeTXN()
> {
> ..
> /* update the statistics */
> rb->spillCount += 1;
> rb->spillTxns += txn->serialized ? 0 : 1;
> rb->spillBytes += size;
>
> Assert(spilled == txn->nentries_mem);
> Assert(dlist_is_empty(&txn->changes));
> txn->nentries_mem = 0;
> txn->serialized = true;
> ..
> }
>
> I am not able to understand the above code.  We are setting the
> serialized parameter a few lines after we check it and increment the
> spillTxns count. Can you please explain it?
>
> Also, isn't spillTxns count bit confusing, because in some cases it
> will include subtransactions and other cases (where the largest picked
> transaction is a subtransaction) it won't include it?
>
Already discussed in the last mail.

I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have attempted to test the performance of (Stream + Spill) vs
> (Stream + BGW pool) and I can see the similar gain what Alexey had
> shown[1].
>
> In addition to this, I have rebased the latest patchset [2] without
> the two-phase logical decoding patch set.
>
> Test results:
> I have repeated the same test as Alexy[1] for 1kk and 1kk data and
> here is my result
> Stream + Spill
> N           time on master(sec)   Total xact time (sec)
> 1kk               6                               21
> 3kk             18                               55
>
> Stream + BGW pool
> N          time on master(sec)  Total xact time (sec)
> 1kk              6                              13
> 3kk            19                              35
>

I think the test results for the master are missing.  Also, how about
running these tests over a network (means master and subscriber are
not on the same machine)?   In general, yours and Alexy's test results
show that there is merit by having workers applying such transactions.
  OTOH, as noted above [1], we are also worried about the performance
of Rollbacks if we follow that approach.  I am not sure how much we
need to worry about Rollabcks if commits are faster, but can we think
of recording the changes in memory and only write to a file if the
changes are above a certain threshold?  I think that might help saving
I/O in many cases.  I am not very sure if we do that how much
additional workers can help, but they might still help.  I think we
need to do some tests and experiments to figure out what is the best
approach?  What do you think?

Tomas, Alexey, do you have any thoughts on this matter?  I think it is
important that we figure out the way to proceed in this patch.

[1] - https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have attempted to test the performance of (Stream + Spill) vs
> > (Stream + BGW pool) and I can see the similar gain what Alexey had
> > shown[1].
> >
> > In addition to this, I have rebased the latest patchset [2] without
> > the two-phase logical decoding patch set.
> >
> > Test results:
> > I have repeated the same test as Alexy[1] for 1kk and 1kk data and
> > here is my result
> > Stream + Spill
> > N           time on master(sec)   Total xact time (sec)
> > 1kk               6                               21
> > 3kk             18                               55
> >
> > Stream + BGW pool
> > N          time on master(sec)  Total xact time (sec)
> > 1kk              6                              13
> > 3kk            19                              35
> >
>
> I think the test results for the master are missing.
Yeah, That time, I was planning to compare spill vs bgworker.
  Also, how about
> running these tests over a network (means master and subscriber are
> not on the same machine)?

Yeah, we should do that that will show the merit of streaming the
in-progress transactions.

   In general, yours and Alexy's test results
> show that there is merit by having workers applying such transactions.
>   OTOH, as noted above [1], we are also worried about the performance
> of Rollbacks if we follow that approach.  I am not sure how much we
> need to worry about Rollabcks if commits are faster, but can we think
> of recording the changes in memory and only write to a file if the
> changes are above a certain threshold?  I think that might help saving
> I/O in many cases.  I am not very sure if we do that how much
> additional workers can help, but they might still help.  I think we
> need to do some tests and experiments to figure out what is the best
> approach?  What do you think?
I agree with the point.  I think we might need to do some small
changes and test to see what could be the best method to handle the
streamed changes at the subscriber end.

>
> Tomas, Alexey, do you have any thoughts on this matter?  I think it is
> important that we figure out the way to proceed in this patch.
>
> [1] - https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru
>


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Tue, Oct 22, 2019 at 10:30:16AM +0530, Dilip Kumar wrote:
>On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
>> > <tomas.vondra@2ndquadrant.com> wrote:
>> > >
>> > >
>> > > Sure, I wasn't really proposing to adding all stats from that patch,
>> > > including those related to streaming.  We need to extract just those
>> > > related to spilling. And yes, it needs to be moved right after 0001.
>> > >
>> > I have extracted the spilling related code to a separate patch on top
>> > of 0001.  I have also fixed some bugs and review comments and attached
>> > as a separate patch.  Later I can merge it to the main patch if you
>> > agree with the changes.
>> >
>>
>> Few comments
>> -------------------------
>> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
>> 1.
>> + {
>> + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
>> + gettext_noop("Sets the maximum memory to be used for logical decoding."),
>> + gettext_noop("This much memory can be used by each internal "
>> + "reorder buffer before spilling to disk or streaming."),
>> + GUC_UNIT_KB
>> + },
>>
>> I think we can remove 'or streaming' from above sentence for now.  We
>> can add it later with later patch where streaming will be allowed.
>Done
>>
>> 2.
>> @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
>> class="parameter">subscription_name</replaceabl
>>           </para>
>>          </listitem>
>>         </varlistentry>
>> +
>> +       <varlistentry>
>> +        <term><literal>work_mem</literal> (<type>integer</type>)</term>
>> +        <listitem>
>> +         <para>
>> +          Limits the amount of memory used to decode changes on the
>> +          publisher.  If not specified, the publisher will use the default
>> +          specified by <varname>logical_decoding_work_mem</varname>. When
>> +          needed, additional data are spilled to disk.
>> +         </para>
>> +        </listitem>
>> +       </varlistentry>
>>
>> It is not clear why we need this parameter at least with this patch?
>> I have raised this multiple times [1][2].
>
>I have moved it out as a separate patch (0003) so that if we need that
>we need this for the streaming transaction then we can keep this.
>>

I'm OK with moving it to a separate patch. That being said I think
ability to control memory usage for individual subscriptions is very
useful. Saying "We don't need such parameter" is essentially equivalent
to saying "One size fits all" and I think we know that's not true.

Imagine a system with multiple subscriptions, some of them mostly
replicating OLTP changes, but one or two replicating tables that are
updated in batches. What we'd have is to allow higher limit for the
batch subscriptions, but much lower limit for the OLTP ones (which they
should never hit in practice).

With a single global GUC, you'll either have a high value - risking
OOM when the OLTP subscriptions happen to decode a batch update, or a
low value affecting the batch subscriotions.

It's not strictly necessary (and we already have such limit), so I'm OK
with treating it as an enhancement for the future.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Tue, Oct 22, 2019 at 11:01:48AM +0530, Dilip Kumar wrote:
>On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > I have attempted to test the performance of (Stream + Spill) vs
>> > (Stream + BGW pool) and I can see the similar gain what Alexey had
>> > shown[1].
>> >
>> > In addition to this, I have rebased the latest patchset [2] without
>> > the two-phase logical decoding patch set.
>> >
>> > Test results:
>> > I have repeated the same test as Alexy[1] for 1kk and 1kk data and
>> > here is my result
>> > Stream + Spill
>> > N           time on master(sec)   Total xact time (sec)
>> > 1kk               6                               21
>> > 3kk             18                               55
>> >
>> > Stream + BGW pool
>> > N          time on master(sec)  Total xact time (sec)
>> > 1kk              6                              13
>> > 3kk            19                              35
>> >
>>
>> I think the test results for the master are missing.
>Yeah, That time, I was planning to compare spill vs bgworker.
>  Also, how about
>> running these tests over a network (means master and subscriber are
>> not on the same machine)?
>
>Yeah, we should do that that will show the merit of streaming the
>in-progress transactions.
>

Which I agree it's an interesting feature, I think we need to stop
adding more stuff to this patch series - it's already complex enough, so
making it even more (unnecessary) stuff is a distraction and will make
it harder to get anything committed. Typical "scope creep".

I think the current behavior (spill to file) is sufficient for v0 and
can be improved later - that's fine. I don't think we need to bother
with comparisons to master very much, because while it might be a bit
slower in some cases, you can always disable streaming (so if there's a
regression for your workload, you can undo that).

>   In general, yours and Alexy's test results
>> show that there is merit by having workers applying such transactions.
>>   OTOH, as noted above [1], we are also worried about the performance
>> of Rollbacks if we follow that approach.  I am not sure how much we
>> need to worry about Rollabcks if commits are faster, but can we think
>> of recording the changes in memory and only write to a file if the
>> changes are above a certain threshold?  I think that might help saving
>> I/O in many cases.  I am not very sure if we do that how much
>> additional workers can help, but they might still help.  I think we
>> need to do some tests and experiments to figure out what is the best
>> approach?  What do you think?
>I agree with the point.  I think we might need to do some small
>changes and test to see what could be the best method to handle the
>streamed changes at the subscriber end.
>
>>
>> Tomas, Alexey, do you have any thoughts on this matter?  I think it is
>> important that we figure out the way to proceed in this patch.
>>
>> [1] - https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru
>>
>

I think the patch should do the simplest thing possible, i.e. what it
does today. Otherwise we'll never get it committed.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alexey Kondratov
Дата:
On 22.10.2019 20:22, Tomas Vondra wrote:
> On Tue, Oct 22, 2019 at 11:01:48AM +0530, Dilip Kumar wrote:
>> On Tue, Oct 22, 2019 at 10:46 AM Amit Kapila 
>> <amit.kapila16@gmail.com> wrote:
>>   In general, yours and Alexy's test results
>>> show that there is merit by having workers applying such transactions.
>>>   OTOH, as noted above [1], we are also worried about the performance
>>> of Rollbacks if we follow that approach.  I am not sure how much we
>>> need to worry about Rollabcks if commits are faster, but can we think
>>> of recording the changes in memory and only write to a file if the
>>> changes are above a certain threshold?  I think that might help saving
>>> I/O in many cases.  I am not very sure if we do that how much
>>> additional workers can help, but they might still help.  I think we
>>> need to do some tests and experiments to figure out what is the best
>>> approach?  What do you think?
>> I agree with the point.  I think we might need to do some small
>> changes and test to see what could be the best method to handle the
>> streamed changes at the subscriber end.
>>
>>>
>>> Tomas, Alexey, do you have any thoughts on this matter?  I think it is
>>> important that we figure out the way to proceed in this patch.
>>>
>>> [1] - 
>>> https://www.postgresql.org/message-id/b25ce80e-f536-78c8-d5c8-a5df3e230785%40postgrespro.ru
>>>
>>
>
> I think the patch should do the simplest thing possible, i.e. what it
> does today. Otherwise we'll never get it committed.
>

I have to agree with Tomas, that keeping things as simple as possible 
should be a main priority right now. Otherwise, the entire patch set 
will pass next release cycle without being committed at least partially. 
In the same time, it resolves important problem from my perspective. It 
moves I/O overhead from primary to replica using large transactions 
streaming, which is a nice to have feature I guess.

Later it would be possible to replace logical apply worker with 
bgworkers pool in a separated patch, if we decide that it is a viable 
solution. Anyway, regarding the Amit's questions:

- I doubt that maintaining a separate buffer on the apply side before 
spilling to disk would help enough. We already have ReorderBuffer with 
logical_work_mem limit, and if we exceeded that limit on the sender 
side, then most probably we exceed it on the applier side as well, 
excepting the case when this new buffer size will be significantly 
higher then logical_work_mem to keep multiple open xacts.

- I still think that we should optimize database for commits, not 
rollbacks. BGworkers pool is dramatically slower for rollbacks-only 
load, though being at least twice as faster for commits-only. I do not 
know how it will perform with real life load, but this drawback may be 
inappropriate for such a general purpose database like Postgres.

- Tomas' implementation of streaming with spilling does not have this 
bias between commits/aborts. However, it has a noticeable performance 
drop (~x5 slower compared with master [1]) for large transaction 
consisting of many small rows. Although it is not of an order of 
magnitude slower.

Another thing is it that about a year ago I have found some problems 
with MVCC/visibility and fixed them somehow [1]. If I get it correctly 
Tomas adapted some of those fixes into his patch set, but I think that 
this part should be reviewed carefully again. I would be glad to check 
it, but now I am a little bit confused with all the patch set variants 
in the thread. Which is the last one? Is it still dependent on 2pc decoding?

[1] 

https://www.postgresql.org/message-id/flat/40c38758-04b5-74f4-c963-cf300f9e5dff%40postgrespro.ru#98d06fefc88122385dacb2f03f7c30f7


Thanks for moving this patch forward!

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Oct 22, 2019 at 10:42 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Oct 22, 2019 at 10:30:16AM +0530, Dilip Kumar wrote:
> >
> >I have moved it out as a separate patch (0003) so that if we need that
> >we need this for the streaming transaction then we can keep this.
> >>
>
> I'm OK with moving it to a separate patch. That being said I think
> ability to control memory usage for individual subscriptions is very
> useful. Saying "We don't need such parameter" is essentially equivalent
> to saying "One size fits all" and I think we know that's not true.
>
> Imagine a system with multiple subscriptions, some of them mostly
> replicating OLTP changes, but one or two replicating tables that are
> updated in batches. What we'd have is to allow higher limit for the
> batch subscriptions, but much lower limit for the OLTP ones (which they
> should never hit in practice).
>

This point is not clear to me.  The changes are recorded in
ReorderBuffer which doesn't have any filtering aka it will have all
the changes irrespective of the subscriber.  How will it make a
difference to have different limits?

> With a single global GUC, you'll either have a high value - risking
> OOM when the OLTP subscriptions happen to decode a batch update, or a
> low value affecting the batch subscriotions.
>
> It's not strictly necessary (and we already have such limit), so I'm OK
> with treating it as an enhancement for the future.
>

I am fine too if its usage is clear.  I might be missing something here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Oct 23, 2019 at 12:32 AM Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:
>
> On 22.10.2019 20:22, Tomas Vondra wrote:
> >
> > I think the patch should do the simplest thing possible, i.e. what it
> > does today. Otherwise we'll never get it committed.
> >
>
> I have to agree with Tomas, that keeping things as simple as possible
> should be a main priority right now. Otherwise, the entire patch set
> will pass next release cycle without being committed at least partially.
> In the same time, it resolves important problem from my perspective. It
> moves I/O overhead from primary to replica using large transactions
> streaming, which is a nice to have feature I guess.
>
> Later it would be possible to replace logical apply worker with
> bgworkers pool in a separated patch, if we decide that it is a viable
> solution. Anyway, regarding the Amit's questions:
>
> - I doubt that maintaining a separate buffer on the apply side before
> spilling to disk would help enough. We already have ReorderBuffer with
> logical_work_mem limit, and if we exceeded that limit on the sender
> side, then most probably we exceed it on the applier side as well,
>

I think on the sender side, the limit is for un-filtered changes
(which means on the ReorderBuffer which has all the changes) whereas,
on the receiver side, we will only have the requested changes which
can make a difference?

> excepting the case when this new buffer size will be significantly
> higher then logical_work_mem to keep multiple open xacts.
>

I am not sure but I think we can have different controlling parameters
on the subscriber-side.

> - I still think that we should optimize database for commits, not
> rollbacks. BGworkers pool is dramatically slower for rollbacks-only
> load, though being at least twice as faster for commits-only. I do not
> know how it will perform with real life load, but this drawback may be
> inappropriate for such a general purpose database like Postgres.
>
> - Tomas' implementation of streaming with spilling does not have this
> bias between commits/aborts. However, it has a noticeable performance
> drop (~x5 slower compared with master [1]) for large transaction
> consisting of many small rows. Although it is not of an order of
> magnitude slower.
>

Did you ever identify the reason why it was slower in that case?  I
can see the numbers shared by you and Dilip which shows that the
BGWorker pool is a really good idea and will work great for
commit-mostly workload whereas the numbers without that are not very
encouraging, maybe we have not benchmarked enough.  This is the reason
I am trying to see if we can do something to get the benefits similar
to what is shown by your idea.

I am not against doing something simple for the first version and then
enhance it later, but it won't be good if we commit it with regression
in some typical cases and depend on the user to use it when it seems
favorable to its case.  Also, sometimes it becomes difficult to
generate enthusiasm to enhance the feature once the main patch is
committed.  I am not telling that always happens or will happen in
this case.  It is better if we put some energy and get things as good
as possible in the first go itself.  I am as much interested as you,
Tomas or others are, otherwise, I wouldn't have spent a lot of time on
this to disentangle it from 2PC patch which seems to get stalled due
to lack of interest.

> Another thing is it that about a year ago I have found some problems
> with MVCC/visibility and fixed them somehow [1]. If I get it correctly
> Tomas adapted some of those fixes into his patch set, but I think that
> this part should be reviewed carefully again.
>

Agreed, I have read your emails and could see that you have done very
good work on this project along with Tomas.  But unfortunately, it
didn't get committed.  At this stage, we are working on just the first
part of the patch which is to allow the data to spill once it crosses
the logical_decoding_work_mem on the master side.  I think we need
more problems to discuss and solve once that is done.

> I would be glad to check
> it, but now I am a little bit confused with all the patch set variants
> in the thread. Which is the last one? Is it still dependent on 2pc decoding?
>

I think the latest patches posted by Dilip are not dependent on
logical decoding, but I haven't studied them yet.  You can find those
at [1][2].  As per discussion in this thread, we are also trying to
see if we can make some part of the patch-series committed first, the
latest patches corresponding to which are posted at [3].

[1] - https://www.postgresql.org/message-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAFiTN-vT%2B42xRbkw%3DhBnp44XkAyZaKZVA5hcvAMsYth3rk7vhg%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-vkFB0RBEjVkLWhdgTYShSrSu3kCYObMghgXEwKA1FXRA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
>

I was wondering whether we have checked the code coverage after this
patch?  Previously, the existing tests seem to be covering most parts
of the function ReorderBufferSerializeTXN [1].  After this patch, the
timing to call ReorderBufferSerializeTXN will change, so that might
impact the testing of the same.  If it is already covered, then I
would like to either add a new test or extend existing test with the
help of new spill counters.  If it is not getting covered, then we
need to think of extending the existing test or write a new test to
cover the function ReorderBufferSerializeTXN.

[1] - https://coverage.postgresql.org/src/backend/replication/logical/reorderbuffer.c.gcov.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
vignesh C
Дата:
On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> I think the patch should do the simplest thing possible, i.e. what it
> does today. Otherwise we'll never get it committed.
>
I found a couple of crashes while reviewing and testing flushing of
open transaction data:
Issue 1:
#0  0x00007f22c5722337 in raise () from /lib64/libc.so.6
#1  0x00007f22c5723a28 in abort () from /lib64/libc.so.6
#2  0x0000000000ec5390 in ExceptionalCondition
(conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
"FailedAssertion",
    fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3  0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
off=64) at ../../../../src/include/lib/ilist.h:458
#4  0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
oldestRunningXid=3834) at reorderbuffer.c:1966
#5  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
buf=0x7ffcbc26dc50) at decode.c:332
#6  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x19af990,
record=0x19afc50) at decode.c:121
#7  0x0000000000b7109e in XLogSendLogical () at walsender.c:2845
#8  0x0000000000b6f5e4 in WalSndLoop (send_data=0xb70f77
<XLogSendLogical>) at walsender.c:2199
#9  0x0000000000b6c7e1 in StartLogicalReplication (cmd=0x1983168) at
walsender.c:1128
#10 0x0000000000b6da6f in exec_replication_command
(cmd_string=0x18f70a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
    at walsender.c:1545

Issue 2:
#0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2  0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
    fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

These failures come randomly.
I'm not able to reproduce this issue with simple test case.
I have attached the test case which I used to test.
I will further try to find a scenario which could reproduce consistently.
Posting it so that it can help someone in identifying the problem
parallelly through code review by experts.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:
>
I have noticed one more problem in the logic of setting the logical
decoding work mem from the create subscription command.  Suppose in
subscription command we don't give the work mem then it sends the
garbage value to the walsender and the walsender overwrite its value
with the garbage value.  After investigating a bit I have found the
reason for the same.

@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
  appendStringInfo(&cmd, "proto_version '%u'",
  options->proto.logical.proto_version);

+ appendStringInfo(&cmd, ", work_mem '%d'",
+ options->proto.logical.work_mem);

I think the problem is we are unconditionally sending the work_mem as
part of the CREATE REPLICATION SLOT, without checking whether it's
valid or not.

--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok)
  sub->name = pstrdup(NameStr(subform->subname));
  sub->owner = subform->subowner;
  sub->enabled = subform->subenabled;
+ sub->workmem = subform->subworkmem;

Another problem is that there is no handling if the subform->subworkmem is NULL.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Kuntal Ghosh
Дата:
Hello hackers,

I've done some performance testing of this feature. Following is my
test case (taken from an earlier thread):

postgres=# CREATE TABLE large_test (num1 bigint, num2 double
precision, num3 double precision);
postgres=# \timing on
postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1,
num2, num3) SELECT round(random()*10), random(), random()*142 FROM
generate_series(1, 1000000) s(i);

I've kept the publisher and subscriber in two different system.

HEAD:
With 1000000 tuples,
Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245
With 10000000 tuples (10 times more),
Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442

With the memory accounting patch, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time:
9648.223 ms (00:09.648), Spill count: 2315
logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time:
9895.161 ms (00:09.895), Spill count 3
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time:
105761.978 ms (01:45.762), Spill count: 23149
logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time:
89985.342 ms (01:29.985), Spill count: 23

With logical decoding of in-progress transactions patch and with
streaming on, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time:
20779.601 ms (00:20.780)
logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time:
9559.953 ms (00:09.560)
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time:
196261.892 ms (03:16.262)
logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time:
90079.286 ms (01:30.079)
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Nov 4, 2019 at 2:43 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> Hello hackers,
>
> I've done some performance testing of this feature. Following is my
> test case (taken from an earlier thread):
>
> postgres=# CREATE TABLE large_test (num1 bigint, num2 double
> precision, num3 double precision);
> postgres=# \timing on
> postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1,
> num2, num3) SELECT round(random()*10), random(), random()*142 FROM
> generate_series(1, 1000000) s(i);
>
> I've kept the publisher and subscriber in two different system.
>
> HEAD:
> With 1000000 tuples,
> Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245
> With 10000000 tuples (10 times more),
> Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442
>
> With the memory accounting patch, following are the performance results:
> With 100000 tuples,
> logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time:
> 9648.223 ms (00:09.648), Spill count: 2315
> logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time:
> 9895.161 ms (00:09.895), Spill count 3
> With 1000000 tuples (10 times more),
> logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time:
> 105761.978 ms (01:45.762), Spill count: 23149
> logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time:
> 89985.342 ms (01:29.985), Spill count: 23
>
> With logical decoding of in-progress transactions patch and with
> streaming on, following are the performance results:
> With 100000 tuples,
> logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time:
> 20779.601 ms (00:20.780)
> logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time:
> 9559.953 ms (00:09.560)
> With 1000000 tuples (10 times more),
> logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time:
> 196261.892 ms (03:16.262)
> logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time:
> 90079.286 ms (01:30.079)
So your result shows that with "streaming on", performance is
degrading?  By any chance did you try to see where is the bottleneck?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Kuntal Ghosh
Дата:
On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> So your result shows that with "streaming on", performance is
> degrading?  By any chance did you try to see where is the bottleneck?
>
Right. But, as we increase the logical_decoding_work_mem, the
performance improves. I've not analyzed the bottleneck yet. I'm
looking into the same.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
vignesh C
Дата:
On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
> >
>
> I was wondering whether we have checked the code coverage after this
> patch?  Previously, the existing tests seem to be covering most parts
> of the function ReorderBufferSerializeTXN [1].  After this patch, the
> timing to call ReorderBufferSerializeTXN will change, so that might
> impact the testing of the same.  If it is already covered, then I
> would like to either add a new test or extend existing test with the
> help of new spill counters.  If it is not getting covered, then we
> need to think of extending the existing test or write a new test to
> cover the function ReorderBufferSerializeTXN.
>
I have run the tests with coverage and found that
ReorderBufferSerializeTXN is not being hit.
The reason it is not being hit is because of the following check in
ReorderBufferCheckMemoryLimit:
    /* bail out if we haven't exceeded the memory limit */
    if (rb->size < logical_decoding_work_mem * 1024L)
        return;
Previously the tests from contrib/test_decoding could hit
ReorderBufferSerializeTXN function.
I'm checking if we can modify the test or add new test to hit
ReorderBufferSerializeTXN function.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > I think the patch should do the simplest thing possible, i.e. what it
> > does today. Otherwise we'll never get it committed.
> >
> I found a couple of crashes while reviewing and testing flushing of
> open transaction data:
>

Thanks for doing these tests.  However, I don't think these issues are
anyway related to this patch.  It seems to be base code issues
manifested by this patch.  See my analysis below.

> Issue 1:
> #0  0x00007f22c5722337 in raise () from /lib64/libc.so.6
> #1  0x00007f22c5723a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec5390 in ExceptionalCondition
> (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
> "FailedAssertion",
>     fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
> off=64) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
> oldestRunningXid=3834) at reorderbuffer.c:1966
> #5  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
> buf=0x7ffcbc26dc50) at decode.c:332
>

This seems to be the problem of base code where we abort immediately
after serializing the changes because in that case, the changes list
will be empty.  I think you can try to reproduce it via the debugger
or by hacking the code such that it serializes after every change and
then if you abort after one change, it should hit this problem.

>
> Issue 2:
> #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
>

This seems to be again the problem with base code as we don't update
the final_lsn for subtransactions during ReorderBufferAbortOld.  This
can also be reproduced with some hacking in code or via debugger in a
similar way as explained for the previous problem but with a
difference that there must be subtransaction involved in this case.

> #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> These failures come randomly.
> I'm not able to reproduce this issue with simple test case.

Yeah, it appears to be difficult to reproduce unless you hack the code
to serialize every change or use debugger to forcefully flush the
changes every time.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I think the patch should do the simplest thing possible, i.e. what it
> > > does today. Otherwise we'll never get it committed.
> > >
> > I found a couple of crashes while reviewing and testing flushing of
> > open transaction data:
> >
>
> Thanks for doing these tests.  However, I don't think these issues are
> anyway related to this patch.  It seems to be base code issues
> manifested by this patch.  See my analysis below.
>
> > Issue 1:
> > #0  0x00007f22c5722337 in raise () from /lib64/libc.so.6
> > #1  0x00007f22c5723a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec5390 in ExceptionalCondition
> > (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
> > "FailedAssertion",
> >     fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
> > lineNumber=458) at assert.c:54
> > #3  0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
> > off=64) at ../../../../src/include/lib/ilist.h:458
> > #4  0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
> > oldestRunningXid=3834) at reorderbuffer.c:1966
> > #5  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
> > buf=0x7ffcbc26dc50) at decode.c:332
> >
>
> This seems to be the problem of base code where we abort immediately
> after serializing the changes because in that case, the changes list
> will be empty.  I think you can try to reproduce it via the debugger
> or by hacking the code such that it serializes after every change and
> then if you abort after one change, it should hit this problem.
>
I think you might need to kill the server after all changes are
serialized otherwise normal abort will hit the ReorderBufferAbort and
that will remove your ReorderBufferTXN entry and you will never hit
this case.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
vignesh C
Дата:
On Mon, Nov 4, 2019 at 3:46 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
> > >
> >
> > I was wondering whether we have checked the code coverage after this
> > patch?  Previously, the existing tests seem to be covering most parts
> > of the function ReorderBufferSerializeTXN [1].  After this patch, the
> > timing to call ReorderBufferSerializeTXN will change, so that might
> > impact the testing of the same.  If it is already covered, then I
> > would like to either add a new test or extend existing test with the
> > help of new spill counters.  If it is not getting covered, then we
> > need to think of extending the existing test or write a new test to
> > cover the function ReorderBufferSerializeTXN.
> >
> I have run the tests with coverage and found that
> ReorderBufferSerializeTXN is not being hit.
> The reason it is not being hit is because of the following check in
> ReorderBufferCheckMemoryLimit:
>     /* bail out if we haven't exceeded the memory limit */
>     if (rb->size < logical_decoding_work_mem * 1024L)
>         return;
> Previously the tests from contrib/test_decoding could hit
> ReorderBufferSerializeTXN function.
> I'm checking if we can modify the test or add new test to hit
> ReorderBufferSerializeTXN function.

I have made one change to the configuration file in
contrib/test_decoding directory, with that the coverage seems to be
fine. I have seen that the coverage is almost like the code before
applying the patch. I have attached the test change and the coverage
report for reference. Coverage report includes the core logical work
memory files for base code and by applying
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
0002-Track-statistics-for-spilling patches.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
vignesh C
Дата:
On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I think the patch should do the simplest thing possible, i.e. what it
> > > does today. Otherwise we'll never get it committed.
> > >
> > I found a couple of crashes while reviewing and testing flushing of
> > open transaction data:
> >
>
> Thanks for doing these tests.  However, I don't think these issues are
> anyway related to this patch.  It seems to be base code issues
> manifested by this patch.  See my analysis below.
>
> > Issue 1:
> > #0  0x00007f22c5722337 in raise () from /lib64/libc.so.6
> > #1  0x00007f22c5723a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec5390 in ExceptionalCondition
> > (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
> > "FailedAssertion",
> >     fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
> > lineNumber=458) at assert.c:54
> > #3  0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
> > off=64) at ../../../../src/include/lib/ilist.h:458
> > #4  0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
> > oldestRunningXid=3834) at reorderbuffer.c:1966
> > #5  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
> > buf=0x7ffcbc26dc50) at decode.c:332
> >
>
> This seems to be the problem of base code where we abort immediately
> after serializing the changes because in that case, the changes list
> will be empty.  I think you can try to reproduce it via the debugger
> or by hacking the code such that it serializes after every change and
> then if you abort after one change, it should hit this problem.
>
> >
> > Issue 2:
> > #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> > #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec4e1d in ExceptionalCondition
> > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> > errorType=0x10ea284 "FailedAssertion",
> >     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> > #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:3052
> > #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:1318
> > #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > txn=0x2b9d778) at reorderbuffer.c:1257
> > #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> > oldestRunningXid=3835) at reorderbuffer.c:1973
> >
>
> This seems to be again the problem with base code as we don't update
> the final_lsn for subtransactions during ReorderBufferAbortOld.  This
> can also be reproduced with some hacking in code or via debugger in a
> similar way as explained for the previous problem but with a
> difference that there must be subtransaction involved in this case.
>
> > #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> > buf=0x7ffcbc74cc00) at decode.c:332
> > #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> > record=0x2b67990) at decode.c:121
> > #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
> >
> > These failures come randomly.
> > I'm not able to reproduce this issue with simple test case.
>
> Yeah, it appears to be difficult to reproduce unless you hack the code
> to serialize every change or use debugger to forcefully flush the
> changes every time.
>

Thanks Amit for your analysis, I was able to reproduce the above issue
consistently by making some code changes and with help of debugger. I
did one change so that it flushes every time instead of flushing after
the buffer size exceeds the logical_decoding_work_mem, attached one of
the transactions and called abort. When the server restarts after
abort, this problem occurs consistently. I could reproduce the issue
with base code also. It seems like this issue is not an issue of
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer patch and
exists from base code. I will post the issue in hackers with details.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Nov 6, 2019 at 11:33 AM vignesh C <vignesh21@gmail.com> wrote:
>
> I have made one change to the configuration file in
> contrib/test_decoding directory, with that the coverage seems to be
> fine. I have seen that the coverage is almost like the code before
> applying the patch. I have attached the test change and the coverage
> report for reference. Coverage report includes the core logical work
> memory files for base code and by applying
> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
> 0002-Track-statistics-for-spilling patches.
>

Thanks,  I have incorporated your test changes and modified the two
patches.  Please see attached.

Changes:
---------------
1. In guc.c, we should include reorderbuffer.h, not logical.h as we
define logical_decoding_work_mem in earlier.

2.
+ *   To limit the amount of memory used by decoded changes, we track memory
+ *   used at the reorder buffer level (i.e. total amount of memory), and for
+ *   each toplevel transaction. When the total amount of used memory exceeds
+ *   the limit, the toplevel transaction consuming the most memory is then
+ *   serialized to disk.

In the above comments, removed 'toplevel' as we track memory usage for
both toplevel and subtransactions.

3. There were still a few mentions of streaming which I have removed.

4. In the docs, the type for stats spill_* was integer whereas it
should be bigint.

5.
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+ ReorderBuffer *rb = ctx->reorder;
+
+ SpinLockAcquire(&MyWalSnd->mutex);
+
+ MyWalSnd->spillTxns = rb->spillTxns;
+ MyWalSnd->spillCount = rb->spillCount;
+ MyWalSnd->spillBytes = rb->spillBytes;
+
+ elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+ rb, rb->spillTxns, rb->spillCount, rb->spillBytes);

Changed the above elog to DEBUG1 as otherwise it was getting printed
very frequently.  I think we can make it DEBUG2 if we want.

6. There was an extra space in rules.out due to which test was
failing.  I have fixed it.

What do you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Nov 7, 2019 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Nov 6, 2019 at 11:33 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > I have made one change to the configuration file in
> > contrib/test_decoding directory, with that the coverage seems to be
> > fine. I have seen that the coverage is almost like the code before
> > applying the patch. I have attached the test change and the coverage
> > report for reference. Coverage report includes the core logical work
> > memory files for base code and by applying
> > 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
> > 0002-Track-statistics-for-spilling patches.
> >
>
> Thanks,  I have incorporated your test changes and modified the two
> patches.  Please see attached.
>
> Changes:
> ---------------
> 1. In guc.c, we should include reorderbuffer.h, not logical.h as we
> define logical_decoding_work_mem in earlier.
Yeah Right.
>
> 2.
> + *   To limit the amount of memory used by decoded changes, we track memory
> + *   used at the reorder buffer level (i.e. total amount of memory), and for
> + *   each toplevel transaction. When the total amount of used memory exceeds
> + *   the limit, the toplevel transaction consuming the most memory is then
> + *   serialized to disk.
>
> In the above comments, removed 'toplevel' as we track memory usage for
> both toplevel and subtransactions.
Correct.
>
> 3. There were still a few mentions of streaming which I have removed.
>
ok
> 4. In the docs, the type for stats spill_* was integer whereas it
> should be bigint.
ok
>
> 5.
> +UpdateSpillStats(LogicalDecodingContext *ctx)
> +{
> + ReorderBuffer *rb = ctx->reorder;
> +
> + SpinLockAcquire(&MyWalSnd->mutex);
> +
> + MyWalSnd->spillTxns = rb->spillTxns;
> + MyWalSnd->spillCount = rb->spillCount;
> + MyWalSnd->spillBytes = rb->spillBytes;
> +
> + elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld",
> + rb, rb->spillTxns, rb->spillCount, rb->spillBytes);
>
> Changed the above elog to DEBUG1 as otherwise it was getting printed
> very frequently.  I think we can make it DEBUG2 if we want.
Yeah, it should not be WARNING.
>
> 6. There was an extra space in rules.out due to which test was
> failing.  I have fixed it.
My Bad.  I have induced while separating out the changes for the spilling.

> What do you think?
I have reviewed your changes and looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Nov 7, 2019 at 3:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 7, 2019 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > What do you think?
> I have reviewed your changes and looks fine to me.
>

Okay, thanks.  I am also happy with the two patches I have posted in
my last email [1].

Tomas, would you like to take a look at those patches and commit them
if you are happy or would you like me to do the same?

Some notes before commit:
--------------------------------------
1.
Commit message need to be changed for the first patch
-------------------------------------------------------------------------
A.
> The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this

    SET logical_decoding_work_mem = '128kB'

> to trigger very aggressive streaming. The minimum value is 64kB.

I think this patch doesn't contain streaming, so we either need to
reword it or remove it.

B.
> The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all
publisherson that instance, or when creating the
 
> subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).

We need to reword this as we have decided to remove the setting from
the subscription side as of now.

2. I think we can change the message level in UpdateSpillStats() to DEBUG2.

3. I think we need catversion bump for the second patch.

4. I think we can combine both patches and commit as one patch, but it
is okay to commit them separately as well.


[1] - https://www.postgresql.org/message-id/CAA4eK1Kdmi6VVguKEHV6Ho2isCPVFdQtt0WLsK10fiuE59_0Yw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alexey Kondratov
Дата:
On 04.11.2019 13:05, Kuntal Ghosh wrote:
> On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> So your result shows that with "streaming on", performance is
>> degrading?  By any chance did you try to see where is the bottleneck?
>>
> Right. But, as we increase the logical_decoding_work_mem, the
> performance improves. I've not analyzed the bottleneck yet. I'm
> looking into the same.

My guess is that 64 kB is just too small value. In the table schema used 
for tests every rows takes at least 24 bytes for storing column values. 
Thus, with this logical_decoding_work_mem value the limit should be hit 
after about 2500+ rows, or about 400 times during transaction of 1000000 
rows size.

It is just too frequent, while ReorderBufferStreamTXN includes a whole 
bunch of logic, e.g. it always starts internal transaction:

/*
  * Decoding needs access to syscaches et al., which in turn use
  * heavyweight locks and such. Thus we need to have enough state around to
  * keep track of those.  The easiest way is to simply use a transaction
  * internally.  That also allows us to easily enforce that nothing writes
  * to the database by checking for xid assignments. ...
  */

Also it issues separated stream_start/stop messages around each streamed 
transaction chunk. So if streaming starts and stops too frequently it 
adds additional overhead and may even interfere with current in-progress 
transaction.

If I get it correctly, then it is rather expected with too small values 
of logical_decoding_work_mem. Probably it may be optimized, but I am not 
sure that it is worth doing right now.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Kuntal Ghosh
Дата:
On Tue, Nov 12, 2019 at 4:12 PM Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:
>
> On 04.11.2019 13:05, Kuntal Ghosh wrote:
> > On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> So your result shows that with "streaming on", performance is
> >> degrading?  By any chance did you try to see where is the bottleneck?
> >>
> > Right. But, as we increase the logical_decoding_work_mem, the
> > performance improves. I've not analyzed the bottleneck yet. I'm
> > looking into the same.
>
> My guess is that 64 kB is just too small value. In the table schema used
> for tests every rows takes at least 24 bytes for storing column values.
> Thus, with this logical_decoding_work_mem value the limit should be hit
> after about 2500+ rows, or about 400 times during transaction of 1000000
> rows size.
>
> It is just too frequent, while ReorderBufferStreamTXN includes a whole
> bunch of logic, e.g. it always starts internal transaction:
>
> /*
>   * Decoding needs access to syscaches et al., which in turn use
>   * heavyweight locks and such. Thus we need to have enough state around to
>   * keep track of those.  The easiest way is to simply use a transaction
>   * internally.  That also allows us to easily enforce that nothing writes
>   * to the database by checking for xid assignments. ...
>   */
>
> Also it issues separated stream_start/stop messages around each streamed
> transaction chunk. So if streaming starts and stops too frequently it
> adds additional overhead and may even interfere with current in-progress
> transaction.
>
Yeah, I've also found the same. With stream_start/stop message, it
writes 1 byte of checksum and 4 bytes of number of sub-transactions
which increases the write amplification significantly.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>

As mentioned by me a few days back that the first patch in this series
is ready to go [1] (I am hoping Tomas will pick it up), so I have
started the review of other patches

Review/Questions on 0002-Immediately-WAL-log-assignments.patch
-------------------------------------------------------------------------------------------------
1. This patch adds the top_xid in WAL whenever the first time WAL for
a subtransaction XID is written to correctly decode the changes of
in-progress transaction.  This patch also removes logging and applying
WAL for XLOG_XACT_ASSIGNMENT which might have some effect.  As replay
of that, it prunes KnownAssignedXids to prevent overflow of that
array.  See comments in procarray.c (KnownAssignedTransactionIds
sub-module).  Can you please explain how after removing the WAL for
XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
and there is no impact of same?

2.
+#define XLOG_INCLUDE_INVALS 0x08 /* include invalidations */

This doesn't seem to be used in this patch.

[1] - https://www.postgresql.org/message-id/CAA4eK1JM0%3DRwODZQrn8DTQ3dbcb9xwKDdHCmVOryAk_xoKf9Nw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> As mentioned by me a few days back that the first patch in this series
> is ready to go [1] (I am hoping Tomas will pick it up), so I have
> started the review of other patches
>
> Review/Questions on 0002-Immediately-WAL-log-assignments.patch
> -------------------------------------------------------------------------------------------------
> 1. This patch adds the top_xid in WAL whenever the first time WAL for
> a subtransaction XID is written to correctly decode the changes of
> in-progress transaction.  This patch also removes logging and applying
> WAL for XLOG_XACT_ASSIGNMENT which might have some effect.  As replay
> of that, it prunes KnownAssignedXids to prevent overflow of that
> array.  See comments in procarray.c (KnownAssignedTransactionIds
> sub-module).  Can you please explain how after removing the WAL for
> XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
> and there is no impact of same?

It seems like a problem to me as well.   One option could be that
since now we are adding the top transaction id in the first WAL of the
subtransaction we can directly update the pg_subtrans and avoid adding
sub transaction id in the KnownAssignedXids and mark it as
lastOverflowedXid.  But, I don't think we should go in that direction
otherwise it will impact the performance of visibility check on the
hot-standby.  Let's see what Tomas has in mind.

>
> 2.
> +#define XLOG_INCLUDE_INVALS 0x08 /* include invalidations */
>
> This doesn't seem to be used in this patch.

>
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Nov 14, 2019 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> >
> > As mentioned by me a few days back that the first patch in this series
> > is ready to go [1] (I am hoping Tomas will pick it up), so I have
> > started the review of other patches
> >
> > Review/Questions on 0002-Immediately-WAL-log-assignments.patch
> > -------------------------------------------------------------------------------------------------
> > 1. This patch adds the top_xid in WAL whenever the first time WAL for
> > a subtransaction XID is written to correctly decode the changes of
> > in-progress transaction.  This patch also removes logging and applying
> > WAL for XLOG_XACT_ASSIGNMENT which might have some effect.  As replay
> > of that, it prunes KnownAssignedXids to prevent overflow of that
> > array.  See comments in procarray.c (KnownAssignedTransactionIds
> > sub-module).  Can you please explain how after removing the WAL for
> > XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
> > and there is no impact of same?
>
> It seems like a problem to me as well.   One option could be that
> since now we are adding the top transaction id in the first WAL of the
> subtransaction we can directly update the pg_subtrans and avoid adding
> sub transaction id in the KnownAssignedXids and mark it as
> lastOverflowedXid.
>

Hmm, I am not sure if we can do that easily because I think in
RecordKnownAssignedTransactionIds, we add those based on the gap via
KnownAssignedXidsAdd and only remove them later while applying WAL for
XLOG_XACT_ASSIGNMENT.  I think if we really want to go in this
direction then for each WAL record we need to check if it has
XLR_BLOCK_ID_TOPLEVEL_XID set and then call function
ProcArrayApplyXidAssignment() with the required information.  I think
this line of attack has WAL overhead both on master whenever
subtransactions are involved and also on hot-standby for doing the
work for each subtransaction separately.  The WAL apply needs to
acquire and release PROCArrayLock in exclusive mode for each
subtransaction whereas now it does it once for
PGPROC_MAX_CACHED_SUBXIDS number of subtransactions which can conflict
with queries running on standby.

The other idea could be that we keep the current XLOG_XACT_ASSIGNMENT
mechanism (WAL logging and apply of same on hot-standby) as it is and
additionally log top_xid the first time when WAL is written for a
subtransaction only when wal_level >= WAL_LEVEL_LOGICAL.  Then use the
same for logical decoding.  The advantage of this approach is that we
will incur the overhead of additional transactionid only when required
especially not with default server configuration.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Nov 14, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Nov 14, 2019 at 9:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Nov 13, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > >
> > > As mentioned by me a few days back that the first patch in this series
> > > is ready to go [1] (I am hoping Tomas will pick it up), so I have
> > > started the review of other patches
> > >
> > > Review/Questions on 0002-Immediately-WAL-log-assignments.patch
> > > -------------------------------------------------------------------------------------------------
> > > 1. This patch adds the top_xid in WAL whenever the first time WAL for
> > > a subtransaction XID is written to correctly decode the changes of
> > > in-progress transaction.  This patch also removes logging and applying
> > > WAL for XLOG_XACT_ASSIGNMENT which might have some effect.  As replay
> > > of that, it prunes KnownAssignedXids to prevent overflow of that
> > > array.  See comments in procarray.c (KnownAssignedTransactionIds
> > > sub-module).  Can you please explain how after removing the WAL for
> > > XLOG_XACT_ASSIGNMENT, we will handle that or I am missing something
> > > and there is no impact of same?
> >
> > It seems like a problem to me as well.   One option could be that
> > since now we are adding the top transaction id in the first WAL of the
> > subtransaction we can directly update the pg_subtrans and avoid adding
> > sub transaction id in the KnownAssignedXids and mark it as
> > lastOverflowedXid.
> >
>
> Hmm, I am not sure if we can do that easily because I think in
> RecordKnownAssignedTransactionIds, we add those based on the gap via
> KnownAssignedXidsAdd and only remove them later while applying WAL for
> XLOG_XACT_ASSIGNMENT.  I think if we really want to go in this
> direction then for each WAL record we need to check if it has
> XLR_BLOCK_ID_TOPLEVEL_XID set and then call function
> ProcArrayApplyXidAssignment() with the required information.  I think
> this line of attack has WAL overhead both on master whenever
> subtransactions are involved and also on hot-standby for doing the
> work for each subtransaction separately.  The WAL apply needs to
> acquire and release PROCArrayLock in exclusive mode for each
> subtransaction whereas now it does it once for
> PGPROC_MAX_CACHED_SUBXIDS number of subtransactions which can conflict
> with queries running on standby.
Right
>
> The other idea could be that we keep the current XLOG_XACT_ASSIGNMENT
> mechanism (WAL logging and apply of same on hot-standby) as it is and
> additionally log top_xid the first time when WAL is written for a
> subtransaction only when wal_level >= WAL_LEVEL_LOGICAL.  Then use the
> same for logical decoding.  The advantage of this approach is that we
> will incur the overhead of additional transactionid only when required
> especially not with default server configuration.
>
> Thoughts?
The idea seems reasonable to me.

Apart from this, I have another question in
0003-Issue-individual-invalidations-with-wal_level-logical.patch

@@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
 {
  AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
     dbId, relId);
+
+ /* Issue an invalidation WAL record (when wal_level=logical) */
+ if (XLogLogicalInfoActive())
+ {
+ SharedInvalidationMessage msg;
+
+ msg.sn.id = SHAREDINVALSNAPSHOT_ID;
+ msg.sn.dbId = dbId;
+ msg.sn.relId = relId;
+
+ LogLogicalInvalidations(1, &msg, false);
+ }
 }

I am not sure why do we need to explicitly WAL log the snapshot
invalidation? because this is logged for invalidating the catalog
snapshot and for logical decoding we use HistoricSnapshot, not the
catalog snapshot.  I might be missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Nov 14, 2019 at 3:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> Apart from this, I have another question in
> 0003-Issue-individual-invalidations-with-wal_level-logical.patch
>
> @@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
>  {
>   AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
>      dbId, relId);
> +
> + /* Issue an invalidation WAL record (when wal_level=logical) */
> + if (XLogLogicalInfoActive())
> + {
> + SharedInvalidationMessage msg;
> +
> + msg.sn.id = SHAREDINVALSNAPSHOT_ID;
> + msg.sn.dbId = dbId;
> + msg.sn.relId = relId;
> +
> + LogLogicalInvalidations(1, &msg, false);
> + }
>  }
>
> I am not sure why do we need to explicitly WAL log the snapshot
> invalidation? because this is logged for invalidating the catalog
> snapshot and for logical decoding we use HistoricSnapshot, not the
> catalog snapshot.
>

I think it has been logged because without this patch as well we log
all the invalidation messages at commit time and process them during
decoding.  However, I agree that this particular invalidation message
is not required for logical decoding for the reason you mentioned.  I
think as we are explicitly logging invalidations, so it is better to
avoid this if we can.

Few other comments on this patch:
1.
+ case REORDER_BUFFER_CHANGE_INVALIDATION:
+
+ /*
+ * Execute the invalidation message locally.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */
+ LocalExecuteInvalidationMessage(&change->data.inval.msg);
+ break;

Here, why are we executing messages individually?  Can't we just
follow what we do in DecodeCommit which is to record the invalidations
in ReorderBufferTXN as we encounter them and then allow them to
execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
reason why we don't do ReorderBufferXidSetCatalogChanges when we
receive any invalidation message?

2.
@@ -3025,8 +3073,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  * although we don't check the memory limit when restoring the changes in
  * this branch (we only do that when initially queueing the changes after
  * decoding), because we will release the changes later, and that will
- * update the accounting too (subtracting the size from the counters).
- * And we don't want to underflow there.
+ * update the accounting too (subtracting the size from the counters). And
+ * we don't want to underflow there.
  */

This seems like an unrelated change.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Nov 14, 2019 at 3:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > Apart from this, I have another question in
> > 0003-Issue-individual-invalidations-with-wal_level-logical.patch
> >
> > @@ -543,6 +588,18 @@ RegisterSnapshotInvalidation(Oid dbId, Oid relId)
> >  {
> >   AddSnapshotInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs,
> >      dbId, relId);
> > +
> > + /* Issue an invalidation WAL record (when wal_level=logical) */
> > + if (XLogLogicalInfoActive())
> > + {
> > + SharedInvalidationMessage msg;
> > +
> > + msg.sn.id = SHAREDINVALSNAPSHOT_ID;
> > + msg.sn.dbId = dbId;
> > + msg.sn.relId = relId;
> > +
> > + LogLogicalInvalidations(1, &msg, false);
> > + }
> >  }
> >
> > I am not sure why do we need to explicitly WAL log the snapshot
> > invalidation? because this is logged for invalidating the catalog
> > snapshot and for logical decoding we use HistoricSnapshot, not the
> > catalog snapshot.
> >
>
> I think it has been logged because without this patch as well we log
> all the invalidation messages at commit time and process them during
> decoding.  However, I agree that this particular invalidation message
> is not required for logical decoding for the reason you mentioned.  I
> think as we are explicitly logging invalidations, so it is better to
> avoid this if we can.

Ok
>
> Few other comments on this patch:
> 1.
> + case REORDER_BUFFER_CHANGE_INVALIDATION:
> +
> + /*
> + * Execute the invalidation message locally.
> + *
> + * XXX Do we need to care about relcacheInitFileInval and
> + * the other fields added to ReorderBufferChange, or just
> + * about the message itself?
> + */
> + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> + break;
>
> Here, why are we executing messages individually?  Can't we just
> follow what we do in DecodeCommit which is to record the invalidations
> in ReorderBufferTXN as we encounter them and then allow them to
> execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> reason why we don't do ReorderBufferXidSetCatalogChanges when we
> receive any invalidation message?
IMHO, the reason is that in DecodeCommit, we get all the invalidation
at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
know which invalidation message to execute so for being safe we have
to execute all.  But, since we are logging all invalidation
individually, we exactly know at this stage which cache to invalidate.
So it is better to only invalidate required cache not all.

>
> 2.
> @@ -3025,8 +3073,8 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
>   * although we don't check the memory limit when restoring the changes in
>   * this branch (we only do that when initially queueing the changes after
>   * decoding), because we will release the changes later, and that will
> - * update the accounting too (subtracting the size from the counters).
> - * And we don't want to underflow there.
> + * update the accounting too (subtracting the size from the counters). And
> + * we don't want to underflow there.
>   */
>
> This seems like an unrelated change.
Indeed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Few other comments on this patch:
> > 1.
> > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > +
> > + /*
> > + * Execute the invalidation message locally.
> > + *
> > + * XXX Do we need to care about relcacheInitFileInval and
> > + * the other fields added to ReorderBufferChange, or just
> > + * about the message itself?
> > + */
> > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > + break;
> >
> > Here, why are we executing messages individually?  Can't we just
> > follow what we do in DecodeCommit which is to record the invalidations
> > in ReorderBufferTXN as we encounter them and then allow them to
> > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > receive any invalidation message?
> IMHO, the reason is that in DecodeCommit, we get all the invalidation
> at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
> know which invalidation message to execute so for being safe we have
> to execute all.  But, since we are logging all invalidation
> individually, we exactly know at this stage which cache to invalidate.
> So it is better to only invalidate required cache not all.
>

In that case, invalidations can be processed multiple times, the first
time when these individual WAL logs for invalidation are processed and
then later at commit time when we accumulate all invalidation messages
and then execute them for REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.
Can we avoid to execute invalidations from other places after this
patch which also includes executing them as part of XLOG_INVALIDATIONS
processing?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Some notes before commit:
> --------------------------------------
> 1.
> Commit message need to be changed for the first patch
> -------------------------------------------------------------------------
> A.
> > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this
>
>     SET logical_decoding_work_mem = '128kB'
>
> > to trigger very aggressive streaming. The minimum value is 64kB.
>
> I think this patch doesn't contain streaming, so we either need to
> reword it or remove it.
>
> B.
> > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all
publisherson that instance, or when creating the
 
> > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).
>
> We need to reword this as we have decided to remove the setting from
> the subscription side as of now.
>
> 2. I think we can change the message level in UpdateSpillStats() to DEBUG2.
>

I have made these modifications and additionally ran pgindent.

> 4. I think we can combine both patches and commit as one patch, but it
> is okay to commit them separately as well.
>

I am not sure if this is a good idea, so still kept them as separate.

Tomas, do let me know if you want to commit these or if you have any
comments, otherwise, I will commit these on Tuesday (19-Nov)?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Few other comments on this patch:
> > > 1.
> > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > +
> > > + /*
> > > + * Execute the invalidation message locally.
> > > + *
> > > + * XXX Do we need to care about relcacheInitFileInval and
> > > + * the other fields added to ReorderBufferChange, or just
> > > + * about the message itself?
> > > + */
> > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > + break;
> > >
> > > Here, why are we executing messages individually?  Can't we just
> > > follow what we do in DecodeCommit which is to record the invalidations
> > > in ReorderBufferTXN as we encounter them and then allow them to
> > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > receive any invalidation message?

I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
commit.  Because this is required to add any committed transaction to
the snapshot if it has done any catalog changes.  So I think there is
no point in setting that flag every time we get an invalidation
message.


> > IMHO, the reason is that in DecodeCommit, we get all the invalidation
> > at one time so, at REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID, we don't
> > know which invalidation message to execute so for being safe we have
> > to execute all.  But, since we are logging all invalidation
> > individually, we exactly know at this stage which cache to invalidate.
> > So it is better to only invalidate required cache not all.
> >
>
> In that case, invalidations can be processed multiple times, the first
> time when these individual WAL logs for invalidation are processed and
> then later at commit time when we accumulate all invalidation messages
> and then execute them for REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.
> Can we avoid to execute invalidations from other places after this
> patch which also includes executing them as part of XLOG_INVALIDATIONS
> processing?
I think we can avoid invalidation which is done as part of
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  I need to further
investigate the invalidation which is done as part of
XLOG_INVALIDATIONS.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > Few other comments on this patch:
> > > > 1.
> > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > +
> > > > + /*
> > > > + * Execute the invalidation message locally.
> > > > + *
> > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > + * the other fields added to ReorderBufferChange, or just
> > > > + * about the message itself?
> > > > + */
> > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > + break;
> > > >
> > > > Here, why are we executing messages individually?  Can't we just
> > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > receive any invalidation message?
>
> I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> commit.  Because this is required to add any committed transaction to
> the snapshot if it has done any catalog changes.
>

Hmm, this is also used to build cid hash map (see
ReorderBufferBuildTupleCidHash) which we need to use while streaming
changes for the in-progress transactions.  So, I think that it would
be required earlier (before commit) as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Some notes before commit:
> > --------------------------------------
> > 1.
> > Commit message need to be changed for the first patch
> > -------------------------------------------------------------------------
> > A.
> > > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this
> >
> >     SET logical_decoding_work_mem = '128kB'
> >
> > > to trigger very aggressive streaming. The minimum value is 64kB.
> >
> > I think this patch doesn't contain streaming, so we either need to
> > reword it or remove it.
> >
> > B.
> > > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for
allpublishers on that instance, or when creating the
 
> > > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).
> >
> > We need to reword this as we have decided to remove the setting from
> > the subscription side as of now.
> >
> > 2. I think we can change the message level in UpdateSpillStats() to DEBUG2.
> >
>
> I have made these modifications and additionally ran pgindent.
>
> > 4. I think we can combine both patches and commit as one patch, but it
> > is okay to commit them separately as well.
> >
>
> I am not sure if this is a good idea, so still kept them as separate.
>

I have committed the first patch.  I will commit the second one
related to stats of spilled xacts on Thursday.  The second patch needs
catalog version bump as well because we are modifying the catalog
contents in that patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Few other comments on this patch:
> > > > > 1.
> > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > +
> > > > > + /*
> > > > > + * Execute the invalidation message locally.
> > > > > + *
> > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > + * about the message itself?
> > > > > + */
> > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > + break;
> > > > >
> > > > > Here, why are we executing messages individually?  Can't we just
> > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > receive any invalidation message?
> >
> > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > commit.  Because this is required to add any committed transaction to
> > the snapshot if it has done any catalog changes.
> >
>
> Hmm, this is also used to build cid hash map (see
> ReorderBufferBuildTupleCidHash) which we need to use while streaming
> changes for the in-progress transactions.  So, I think that it would
> be required earlier (before commit) as well.
>
Oh right,  I guess I missed that part.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > Few other comments on this patch:
> > > > > > 1.
> > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > > +
> > > > > > + /*
> > > > > > + * Execute the invalidation message locally.
> > > > > > + *
> > > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > > + * about the message itself?
> > > > > > + */
> > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > > + break;
> > > > > >
> > > > > > Here, why are we executing messages individually?  Can't we just
> > > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > > receive any invalidation message?
> > >
> > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > > commit.  Because this is required to add any committed transaction to
> > > the snapshot if it has done any catalog changes.
> > >
> >
> > Hmm, this is also used to build cid hash map (see
> > ReorderBufferBuildTupleCidHash) which we need to use while streaming
> > changes for the in-progress transactions.  So, I think that it would
> > be required earlier (before commit) as well.
> >
> Oh right,  I guess I missed that part.

Attached a new rebased version of the patch set.   I have fixed all
the issues discussed up-thread and agreed upon.

Pending Issues:
1. The default value of the logical_decoding_work_mem is set to 64kb
in test_decoding/logical.conf.  So we need to change the expected
output files for the test decoding module.
2. Need to complete the patch for concurrent abort handling of the
(sub)transaction.  There are some pending issues with the existing
patch[1].

[1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Nov 20, 2019 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Few other comments on this patch:
> > > > > > > 1.
> > > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * Execute the invalidation message locally.
> > > > > > > + *
> > > > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > > > + * about the message itself?
> > > > > > > + */
> > > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > > > + break;
> > > > > > >
> > > > > > > Here, why are we executing messages individually?  Can't we just
> > > > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > > > receive any invalidation message?
> > > >
> > > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > > > commit.  Because this is required to add any committed transaction to
> > > > the snapshot if it has done any catalog changes.
> > > >
> > >
> > > Hmm, this is also used to build cid hash map (see
> > > ReorderBufferBuildTupleCidHash) which we need to use while streaming
> > > changes for the in-progress transactions.  So, I think that it would
> > > be required earlier (before commit) as well.
> > >
> > Oh right,  I guess I missed that part.
>
> Attached a new rebased version of the patch set.   I have fixed all
> the issues discussed up-thread and agreed upon.
>
> Pending Issues:
> 1. The default value of the logical_decoding_work_mem is set to 64kb
> in test_decoding/logical.conf.  So we need to change the expected
> output files for the test decoding module.
> 2. Need to complete the patch for concurrent abort handling of the
> (sub)transaction.  There are some pending issues with the existing
> patch[1].
> [1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com
Apart from these there is one more issue reported upthread[2]
[2] https://www.postgresql.org/message-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt%3DDYS%3DjhH%2BjiCoBODdaw%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Nov 19, 2019 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Nov 16, 2019 at 6:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Some notes before commit:
> > > --------------------------------------
> > > 1.
> > > Commit message need to be changed for the first patch
> > > -------------------------------------------------------------------------
> > > A.
> > > > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this
> > >
> > >     SET logical_decoding_work_mem = '128kB'
> > >
> > > > to trigger very aggressive streaming. The minimum value is 64kB.
> > >
> > > I think this patch doesn't contain streaming, so we either need to
> > > reword it or remove it.
> > >
> > > B.
> > > > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for
allpublishers on that instance, or when creating the
 
> > > > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).
> > >
> > > We need to reword this as we have decided to remove the setting from
> > > the subscription side as of now.
> > >
> > > 2. I think we can change the message level in UpdateSpillStats() to DEBUG2.
> > >
> >
> > I have made these modifications and additionally ran pgindent.
> >
> > > 4. I think we can combine both patches and commit as one patch, but it
> > > is okay to commit them separately as well.
> > >
> >
> > I am not sure if this is a good idea, so still kept them as separate.
> >
>
> I have committed the first patch.  I will commit the second one
> related to stats of spilled xacts on Thursday.  The second patch needs
> catalog version bump as well because we are modifying the catalog
> contents in that patch.
>

Committed the second one as well.  Now, we can move to a review of
patches for "streaming of in-progress transactions".

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Nov 21, 2019 at 9:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 20, 2019 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Few other comments on this patch:
> > > > > > > > 1.
> > > > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * Execute the invalidation message locally.
> > > > > > > > + *
> > > > > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > > > > + * about the message itself?
> > > > > > > > + */
> > > > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > > > > + break;
> > > > > > > >
> > > > > > > > Here, why are we executing messages individually?  Can't we just
> > > > > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > > > > receive any invalidation message?
> > > > >
> > > > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > > > > commit.  Because this is required to add any committed transaction to
> > > > > the snapshot if it has done any catalog changes.
> > > > >
> > > >
> > > > Hmm, this is also used to build cid hash map (see
> > > > ReorderBufferBuildTupleCidHash) which we need to use while streaming
> > > > changes for the in-progress transactions.  So, I think that it would
> > > > be required earlier (before commit) as well.
> > > >
> > > Oh right,  I guess I missed that part.
> >
> > Attached a new rebased version of the patch set.   I have fixed all
> > the issues discussed up-thread and agreed upon.
> >
> > Pending Issues:
> > 1. The default value of the logical_decoding_work_mem is set to 64kb
> > in test_decoding/logical.conf.  So we need to change the expected
> > output files for the test decoding module.
> > 2. Need to complete the patch for concurrent abort handling of the
> > (sub)transaction.  There are some pending issues with the existing
> > patch[1].
> > [1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com
> Apart from these there is one more issue reported upthread[2]
> [2] https://www.postgresql.org/message-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt%3DDYS%3DjhH%2BjiCoBODdaw%40mail.gmail.com
>
I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set.  I have added the version number so that we
can track the changes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Michael Paquier
Дата:
On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> I have rebased the patch on the latest head and also fix the issue of
> "concurrent abort handling of the (sub)transaction." and attached as
> (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> the complete patch set.  I have added the version number so that we
> can track the changes.

The patch has rotten a bit and does not apply anymore.  Could you
please send a rebased version?  I have moved it to next CF, waiting on
author.
--
Michael

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > I have rebased the patch on the latest head and also fix the issue of
> > "concurrent abort handling of the (sub)transaction." and attached as
> > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > the complete patch set.  I have added the version number so that we
> > can track the changes.
>
> The patch has rotten a bit and does not apply anymore.  Could you
> please send a rebased version?  I have moved it to next CF, waiting on
> author.

I have rebased the patch set on the latest head.

Apart from this, there is one issue reported by my colleague Vignesh.
The issue is that if we use more than two relations in a transaction
then there is an error on standby (no relation map entry for remote
relation ID 16390).  After analyzing I have found that for the
streaming transaction an "is_schema_sent" flag is kept in
ReorderBufferTXN.  And, I think that is done so that we can send the
schema for each transaction stream so that if any subtransaction gets
aborted we don't lose the logical WAL for that schema.  But, this
solution has induced a very basic issue that if a transaction operate
on more than 1 relation then after sending the schema for the first
relation it will mark the flag true and the schema for the subsequent
relations will never be sent.  I am still working on finding a better
solution for this if anyone has any opinion/solution about this feel
free to suggest.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Dec 2, 2019 at 2:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > I have rebased the patch on the latest head and also fix the issue of
> > > "concurrent abort handling of the (sub)transaction." and attached as
> > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > the complete patch set.  I have added the version number so that we
> > > can track the changes.
> >
> > The patch has rotten a bit and does not apply anymore.  Could you
> > please send a rebased version?  I have moved it to next CF, waiting on
> > author.
>
> I have rebased the patch set on the latest head.
>
I have review the patch set and here are few comments/questions

1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

Should we show the tuple in the streamed change like we do for the
pg_decode_change?

2. pg_logical_slot_get_changes_guts
It recreate the decoding slot [ctx =
CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
to false, should we pass a parameter to
pg_logical_slot_get_changes_guts saying whether we want streamed results or not

3.
+ XLogRecPtr prev_lsn = InvalidXLogRecPtr;
  ReorderBufferChange *change;
  ReorderBufferChange *specinsert = NULL;

@@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
  Relation relation = NULL;
  Oid reloid;

+ /*
+ * Enforce correct ordering of changes, merged from multiple
+ * subtransactions. The changes may have the same LSN due to
+ * MULTI_INSERT xlog records.
+ */
+ if (prev_lsn != InvalidXLogRecPtr)
+ Assert(prev_lsn <= change->lsn);
+
+ prev_lsn = change->lsn;
I did not understand, how this change is relavent to this patch

4.
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn, command_id);

In which case, txn->is_schema_sent will be true, because at the end of
the stream in ReorderBufferExecuteInvalidations we are always setting
it false,
so while sending next stream it will always be false.  That means we
never required snapshot_now variable in ReorderBufferTXN.

5.
@@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
  txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

  txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * We read catalog changes from WAL, which are not yet sent, so
+ * invalidate current schema in order output plugin can resend
+ * schema again.
+ */
+ txn->is_schema_sent = false;

Same as point 4, during decode time it will never be true.

6.
+ /* send fields */
+ pq_sendint64(out, commit_lsn);
+ pq_sendint64(out, txn->end_lsn);
+ pq_sendint64(out, txn->commit_time);

Commit_time and end_lsn is used in standby_feedback


7.
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
We can not rollback intermediate subtransaction without rollbacking
latest sub-transaction, so why do we need
to search in the array?  It will always be the the last subxact no?

8.
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);

Why feedback is sent for every change?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > I have rebased the patch on the latest head and also fix the issue of
> > > "concurrent abort handling of the (sub)transaction." and attached as
> > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > the complete patch set.  I have added the version number so that we
> > > can track the changes.
> >
> > The patch has rotten a bit and does not apply anymore.  Could you
> > please send a rebased version?  I have moved it to next CF, waiting on
> > author.
>
> I have rebased the patch set on the latest head.
>
> Apart from this, there is one issue reported by my colleague Vignesh.
> The issue is that if we use more than two relations in a transaction
> then there is an error on standby (no relation map entry for remote
> relation ID 16390).  After analyzing I have found that for the
> streaming transaction an "is_schema_sent" flag is kept in
> ReorderBufferTXN.  And, I think that is done so that we can send the
> schema for each transaction stream so that if any subtransaction gets
> aborted we don't lose the logical WAL for that schema.  But, this
> solution has induced a very basic issue that if a transaction operate
> on more than 1 relation then after sending the schema for the first
> relation it will mark the flag true and the schema for the subsequent
> relations will never be sent.
>

How about keeping a list of top-level xids in each RelationSyncEntry?
Basically, whenever we send the schema for any transaction, we note
that in RelationSyncEntry and at abort time we can remove xid from the
list.  Now, whenever, we check whether to send schema for any
operation in a transaction, we will check if our xid is present in
that list for a particular RelationSyncEntry and take an action based
on that (if xid is present, then we won't send the schema, otherwise,
send it).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > > I have rebased the patch on the latest head and also fix the issue of
> > > > "concurrent abort handling of the (sub)transaction." and attached as
> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > > the complete patch set.  I have added the version number so that we
> > > > can track the changes.
> > >
> > > The patch has rotten a bit and does not apply anymore.  Could you
> > > please send a rebased version?  I have moved it to next CF, waiting on
> > > author.
> >
> > I have rebased the patch set on the latest head.
> >
> > Apart from this, there is one issue reported by my colleague Vignesh.
> > The issue is that if we use more than two relations in a transaction
> > then there is an error on standby (no relation map entry for remote
> > relation ID 16390).  After analyzing I have found that for the
> > streaming transaction an "is_schema_sent" flag is kept in
> > ReorderBufferTXN.  And, I think that is done so that we can send the
> > schema for each transaction stream so that if any subtransaction gets
> > aborted we don't lose the logical WAL for that schema.  But, this
> > solution has induced a very basic issue that if a transaction operate
> > on more than 1 relation then after sending the schema for the first
> > relation it will mark the flag true and the schema for the subsequent
> > relations will never be sent.
> >
>
> How about keeping a list of top-level xids in each RelationSyncEntry?
> Basically, whenever we send the schema for any transaction, we note
> that in RelationSyncEntry and at abort time we can remove xid from the
> list.  Now, whenever, we check whether to send schema for any
> operation in a transaction, we will check if our xid is present in
> that list for a particular RelationSyncEntry and take an action based
> on that (if xid is present, then we won't send the schema, otherwise,
> send it).
The idea make sense to me.  I will try to write a patch for this and test.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have review the patch set and here are few comments/questions
>
> 1.
> +static void
> +pg_decode_stream_change(LogicalDecodingContext *ctx,
> + ReorderBufferTXN *txn,
> + Relation relation,
> + ReorderBufferChange *change)
> +{
> + OutputPluginPrepareWrite(ctx, true);
> + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> + OutputPluginWrite(ctx, true);
> +}
>
> Should we show the tuple in the streamed change like we do for the
> pg_decode_change?
>

I think so.  The patch shows the message in
pg_decode_stream_message(), so why to prohibit showing tuple here?

> 2. pg_logical_slot_get_changes_guts
> It recreate the decoding slot [ctx =
> CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
> to false, should we pass a parameter to
> pg_logical_slot_get_changes_guts saying whether we want streamed results or not
>

CreateDecodingContext internally calls StartupDecodingContext which
sets the value of streaming based on if the plugin has provided
callbacks for streaming functions. Isn't that sufficient?  Why do we
need additional parameters here?

> 3.
> + XLogRecPtr prev_lsn = InvalidXLogRecPtr;
>   ReorderBufferChange *change;
>   ReorderBufferChange *specinsert = NULL;
>
> @@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
>   Relation relation = NULL;
>   Oid reloid;
>
> + /*
> + * Enforce correct ordering of changes, merged from multiple
> + * subtransactions. The changes may have the same LSN due to
> + * MULTI_INSERT xlog records.
> + */
> + if (prev_lsn != InvalidXLogRecPtr)
> + Assert(prev_lsn <= change->lsn);
> +
> + prev_lsn = change->lsn;
> I did not understand, how this change is relavent to this patch
>

This is just to ensure that changes are in LSN order.  I think as we
are merging the changes before commit for streaming, it is good to
have such an Assertion for ReorderBufferStreamTXN.   And, if we want
to have it in ReorderBufferStreamTXN, then there is no harm in keeping
it in ReorderBufferCommit() at least to keep the code consistent.  Do
you see any problem with this?

> 4.
> + /*
> + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> + * information about subtransactions, which could arrive after streaming start.
> + */
> + if (!txn->is_schema_sent)
> + snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
> + txn, command_id);
>
> In which case, txn->is_schema_sent will be true, because at the end of
> the stream in ReorderBufferExecuteInvalidations we are always setting
> it false,
> so while sending next stream it will always be false.  That means we
> never required snapshot_now variable in ReorderBufferTXN.
>

You are probably right, but as discussed we need to change this part
of design/code (when to send schema changes) due to the issues
discovered.  So, I think this part will anyway change when we fix that
problem.

> 5.
> @@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> *rb, TransactionId xid,
>   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
>
>   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> +
> + /*
> + * We read catalog changes from WAL, which are not yet sent, so
> + * invalidate current schema in order output plugin can resend
> + * schema again.
> + */
> + txn->is_schema_sent = false;
>
> Same as point 4, during decode time it will never be true.
>

Sure, my previous point's reply applies here as well.

> 6.
> + /* send fields */
> + pq_sendint64(out, commit_lsn);
> + pq_sendint64(out, txn->end_lsn);
> + pq_sendint64(out, txn->commit_time);
>
> Commit_time and end_lsn is used in standby_feedback
>

I don't understand what you mean by this.  Can you be a bit more clear?

>
> 7.
> + /* FIXME optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> + {
> + if (subxacts[i - 1].xid == subxid)
> + {
> + subidx = (i - 1);
> + found = true;
> + break;
> + }
> + }
> We can not rollback intermediate subtransaction without rollbacking
> latest sub-transaction, so why do we need
> to search in the array?  It will always be the the last subxact no?
>

The same thing is already mentioned in the comments above this code
("XXX Or perhaps we can rely on the aborts to arrive in the reverse
order, i.e. from the inner-most subxact (when nested)? In which case
we could simply check the last element.").  I think what you are
saying is probably right, but we can leave this as it is for now
because this is a minor optimization which can be done later as well
if required.  However, if you see any correctness issue, then we can
discuss.


> 8.
> + /*
> + * send feedback to upstream
> + *
> + * XXX Probably should send a valid LSN. But which one?
> + */
> + send_feedback(InvalidXLogRecPtr, false, false);
>
> Why feedback is sent for every change?
>

I will study this part of the patch and let you know my opinion.

Few comments on this patch series:

0001-Immediately-WAL-log-assignments:
------------------------------------------------------------

The commit message still refers to the old design for this patch.  I
think you need to modify the commit message as per the latest patch.

0002-Issue-individual-invalidations-with-wal_level-log
----------------------------------------------------------------------------
1.
xact_desc_invalidations(StringInfo buf,
{
..
+ else if (msg->id == SHAREDINVALSNAPSHOT_ID)
+ appendStringInfo(buf, " snapshot %u", msg->sn.relId);

You have removed logging for the above cache but forgot to remove its
reference from one of the places.  Also, I think you need to add a
comment somewhere in inval.c to say why you are writing for WAL for
some types of invalidations and not for others?

0003-Extend-the-output-plugin-API-with-stream-methods
--------------------------------------------------------------------------------
1.
+     are required, while <function>stream_message_cb</function> and
+     <function>stream_message_cb</function> are optional.

stream_message_cb is mentioned twice.  It seems the second one is for truncate.

2.
size of the transaction size and network bandwidth, the transfer time
+    may significantly increase the apply lag.

/size of the transaction size/size of the transaction

no need to mention size twice.

3.
+    Similarly to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress
transactions)
+    exceeds limit defined by <varname>logical_work_mem</varname> setting.

The guc name used is wrong.  /Similarly to/Similar to/

4.
stream_start_cb_wrapper()
{
..
+ /* state.report_location = apply_lsn; */
..
+ /* FIXME ctx->write_location = apply_lsn; */
..
}

See, if we can fix these and similar in the callback for the stop.  I
think we don't have final_lsn till we commit/abort.  Can we compute
before calling these API's?


0005-Gracefully-handle-concurrent-aborts-of-uncommitte
----------------------------------------------------------------------------------
1.
@@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
  PG_CATCH();
  {
  /* TODO: Encapsulate cleanup
from the PG_TRY and PG_CATCH blocks */
+
  if (iterstate)
  ReorderBufferIterTXNFinish(rb, iterstate);

Spurious line change.

2. The commit message of this patch refers to Prepared transactions.
I think that needs to be changed.

0006-Implement-streaming-mode-in-ReorderBuffer
-------------------------------------------------------------------------
1.
+
+/* iterator for streaming (only get data from memory) */
+static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
+
ReorderBuffer *rb,
+
ReorderBufferTXN
*txn);
+
+static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
+    ReorderBuffer *rb,
+
   ReorderBufferStreamIterTXNState * state);
+
+static void ReorderBufferStreamIterTXNFinish(
+
ReorderBuffer *rb,
+
ReorderBufferStreamIterTXNState * state);

Do we really need to introduce new APIs for iterating over changes
from streamed transactions?  Why can't we reuse the same API's as we
use for committed xacts?

2.
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)

Please write some comments atop ReorderBufferStreamCommit.

3.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
..
+ if (txn->snapshot_now
== NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+
Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id ==
InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+
subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+
+ if (subtxn->base_snapshot != NULL &&
+
(txn->base_snapshot == NULL ||
+ txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
+ {
+
txn->base_snapshot = subtxn->base_snapshot;

The logic here seems to be correct, but I am not sure why it is not
considered to purge the base snapshot before assigning the subtxn's
snapshot and similarly, we have not purged snapshot for subtxn once we
are done with it.  I think we can use
ReorderBufferTransferSnapToParent to replace part of the logic here.
Do you see any reason for doing things differently here?

4. In ReorderBufferStreamTXN, why do you need to use
ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.

5. I see a lot of code similarity in ReorderBufferStreamTXN and
existing ReorderBufferCommit. I understand that there are some subtle
differences due to which we need to write this new function but can't
we encapsulate the specific parts of code in functions and then call
from both places.  I am talking about code in different cases for
change->action.

6. + * Note: We never stream and serialize a transaction at the same time (e
/(e/(we

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Robert Haas
Дата:
On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have rebased the patch set on the latest head.

0001 looks like a clever approach, but are you sure it doesn't hurt
performance when many small XLOG records are being inserted? I think
XLogRecordAssemble() can get pretty hot in some workloads.

With regard to 0002, logging a separate WAL record for each
invalidation seems painful; I think most operations that generate
invalidations generate a bunch of them all at once. Perhaps you could
just queue up invalidations as they happen, and then force anything
that's been queued up to be emitted into WAL just before you emit any
WAL record that might need to be decoded.

Regarding 0005, it seems to me that this is no good:

+ errmsg("improper heap_getnext call")));

I think we should be using elog() rather than ereport() here, because
this should only happen if there's a bug in a logical decoding plugin.
At first, I thought maybe this should just be an Assert(), but since
there are third-party logical decoding plugins available, checking
this even in non-assert builds seems like a good idea. However, I
think making it translatable is overkill; users should never see this,
only developers.

I also think that the message is really bad, because it just tells you
did something bad. It gives no inkling as to why it was bad.

0006 contains lots of XXX comments that look like real issues. I guess
those need to be fixed. Also, why don't we do the thing that the
commit message for 0006 says we could "theoretically" do? I don't
understand why we need the k-way merge at all,

+ if (prev_lsn != InvalidXLogRecPtr)
+ Assert(prev_lsn <= change->lsn);

There is no reason to ever write an if statement that contains only an
Assert, and it's bad style. Write Assert(prev_lsn == InvalidXLogRecPtr
|| prev_lsn <= change->lsn), or better yet, use XLogRecPtrIsInvalid.

The purpose and mechanism of the is_schema_sent flag is not clear to
me. The word "schema" here seems to be being used to mean "snapshot,"
which is rather confusing.

I'm also somewhat unclear on what's happening here with invalidations.
Perhaps that's as much a defect in my understanding as it is
reflective of any problem with the patch, but I also don't see any
comments either in 0002 or later patches explaining the theory of
operation. If I've missed some, please point me in the right
direction. Hypothetically speaking, it seems to me that if you just
did InvalidateSystemCaches() every time the snapshot changed, you
wouldn't need anything else (unless we're concerned with
non-transactional invalidation messages like smgr and relmapper
invalidations; not quite sure how those are handled). And, on the
other hand, if we don't do InvalidateSystemCaches() every time the
snapshot changes, then I don't understand why this works now, even
without streaming.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have review the patch set and here are few comments/questions
> >
> > 1.
> > +static void
> > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > + ReorderBufferTXN *txn,
> > + Relation relation,
> > + ReorderBufferChange *change)
> > +{
> > + OutputPluginPrepareWrite(ctx, true);
> > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > + OutputPluginWrite(ctx, true);
> > +}
> >
> > Should we show the tuple in the streamed change like we do for the
> > pg_decode_change?
> >
>
> I think so.  The patch shows the message in
> pg_decode_stream_message(), so why to prohibit showing tuple here?
>
> > 2. pg_logical_slot_get_changes_guts
> > It recreate the decoding slot [ctx =
> > CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
> > to false, should we pass a parameter to
> > pg_logical_slot_get_changes_guts saying whether we want streamed results or not
> >
>
> CreateDecodingContext internally calls StartupDecodingContext which
> sets the value of streaming based on if the plugin has provided
> callbacks for streaming functions. Isn't that sufficient?  Why do we
> need additional parameters here?

I don't think that if plugin provides streaming function then we
should stream.  Like pgoutput plugin provides streaming function but
we only stream if streaming is on in create subscription command.  So
I feel that should be true with any plugin.

>
> > 3.
> > + XLogRecPtr prev_lsn = InvalidXLogRecPtr;
> >   ReorderBufferChange *change;
> >   ReorderBufferChange *specinsert = NULL;
> >
> > @@ -1565,6 +1965,16 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> >   Relation relation = NULL;
> >   Oid reloid;
> >
> > + /*
> > + * Enforce correct ordering of changes, merged from multiple
> > + * subtransactions. The changes may have the same LSN due to
> > + * MULTI_INSERT xlog records.
> > + */
> > + if (prev_lsn != InvalidXLogRecPtr)
> > + Assert(prev_lsn <= change->lsn);
> > +
> > + prev_lsn = change->lsn;
> > I did not understand, how this change is relavent to this patch
> >
>
> This is just to ensure that changes are in LSN order.  I think as we
> are merging the changes before commit for streaming, it is good to
> have such an Assertion for ReorderBufferStreamTXN.   And, if we want
> to have it in ReorderBufferStreamTXN, then there is no harm in keeping
> it in ReorderBufferCommit() at least to keep the code consistent.  Do
> you see any problem with this?
I am fine with this.
>
> > 4.
> > + /*
> > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> > + * information about subtransactions, which could arrive after streaming start.
> > + */
> > + if (!txn->is_schema_sent)
> > + snapshot_now = ReorderBufferCopySnap(rb, txn->base_snapshot,
> > + txn, command_id);
> >
> > In which case, txn->is_schema_sent will be true, because at the end of
> > the stream in ReorderBufferExecuteInvalidations we are always setting
> > it false,
> > so while sending next stream it will always be false.  That means we
> > never required snapshot_now variable in ReorderBufferTXN.
> >
>
> You are probably right, but as discussed we need to change this part
> of design/code (when to send schema changes) due to the issues
> discovered.  So, I think this part will anyway change when we fix that
> problem.
Make sense.
>
> > 5.
> > @@ -2299,6 +2746,23 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > *rb, TransactionId xid,
> >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> >
> >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > +
> > + /*
> > + * We read catalog changes from WAL, which are not yet sent, so
> > + * invalidate current schema in order output plugin can resend
> > + * schema again.
> > + */
> > + txn->is_schema_sent = false;
> >
> > Same as point 4, during decode time it will never be true.
> >
>
> Sure, my previous point's reply applies here as well.
ok
>
> > 6.
> > + /* send fields */
> > + pq_sendint64(out, commit_lsn);
> > + pq_sendint64(out, txn->end_lsn);
> > + pq_sendint64(out, txn->commit_time);
> >
> > Commit_time and end_lsn is used in standby_feedback
> >
>
> I don't understand what you mean by this.  Can you be a bit more clear?
I think I paste it here by mistake.  just ignore it.
>
> >
> > 7.
> > + /* FIXME optimize the search by bsearch on sorted data */
> > + for (i = nsubxacts; i > 0; i--)
> > + {
> > + if (subxacts[i - 1].xid == subxid)
> > + {
> > + subidx = (i - 1);
> > + found = true;
> > + break;
> > + }
> > + }
> > We can not rollback intermediate subtransaction without rollbacking
> > latest sub-transaction, so why do we need
> > to search in the array?  It will always be the the last subxact no?
> >
>
> The same thing is already mentioned in the comments above this code
> ("XXX Or perhaps we can rely on the aborts to arrive in the reverse
> order, i.e. from the inner-most subxact (when nested)? In which case
> we could simply check the last element.").  I think what you are
> saying is probably right, but we can leave this as it is for now
> because this is a minor optimization which can be done later as well
> if required.  However, if you see any correctness issue, then we can
> discuss.
I think more than optimization here we have the question of whether
this loop is required at all or not.  Because, by optimizing we are
not adding the complexity, infact it will be simple.  I think here we
need more analysis that whether we need to traverse the array or not.
So maybe for time being we can leave this as it is.
>
> > 8.
> > + /*
> > + * send feedback to upstream
> > + *
> > + * XXX Probably should send a valid LSN. But which one?
> > + */
> > + send_feedback(InvalidXLogRecPtr, false, false);
> >
> > Why feedback is sent for every change?
> >
>
> I will study this part of the patch and let you know my opinion.
Sure.
>
> Few comments on this patch series:
>
> 0001-Immediately-WAL-log-assignments:
> ------------------------------------------------------------
>
> The commit message still refers to the old design for this patch.  I
> think you need to modify the commit message as per the latest patch.
>
> 0002-Issue-individual-invalidations-with-wal_level-log
> ----------------------------------------------------------------------------
> 1.
> xact_desc_invalidations(StringInfo buf,
> {
> ..
> + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
>
> You have removed logging for the above cache but forgot to remove its
> reference from one of the places.  Also, I think you need to add a
> comment somewhere in inval.c to say why you are writing for WAL for
> some types of invalidations and not for others?
>
> 0003-Extend-the-output-plugin-API-with-stream-methods
> --------------------------------------------------------------------------------
> 1.
> +     are required, while <function>stream_message_cb</function> and
> +     <function>stream_message_cb</function> are optional.
>
> stream_message_cb is mentioned twice.  It seems the second one is for truncate.
>
> 2.
> size of the transaction size and network bandwidth, the transfer time
> +    may significantly increase the apply lag.
>
> /size of the transaction size/size of the transaction
>
> no need to mention size twice.
>
> 3.
> +    Similarly to spill-to-disk behavior, streaming is triggered when the total
> +    amount of changes decoded from the WAL (for all in-progress
> transactions)
> +    exceeds limit defined by <varname>logical_work_mem</varname> setting.
>
> The guc name used is wrong.  /Similarly to/Similar to/
>
> 4.
> stream_start_cb_wrapper()
> {
> ..
> + /* state.report_location = apply_lsn; */
> ..
> + /* FIXME ctx->write_location = apply_lsn; */
> ..
> }
>
> See, if we can fix these and similar in the callback for the stop.  I
> think we don't have final_lsn till we commit/abort.  Can we compute
> before calling these API's?
>
>
> 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> ----------------------------------------------------------------------------------
> 1.
> @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
>   PG_CATCH();
>   {
>   /* TODO: Encapsulate cleanup
> from the PG_TRY and PG_CATCH blocks */
> +
>   if (iterstate)
>   ReorderBufferIterTXNFinish(rb, iterstate);
>
> Spurious line change.
>
> 2. The commit message of this patch refers to Prepared transactions.
> I think that needs to be changed.
>
> 0006-Implement-streaming-mode-in-ReorderBuffer
> -------------------------------------------------------------------------
> 1.
> +
> +/* iterator for streaming (only get data from memory) */
> +static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
> +
> ReorderBuffer *rb,
> +
> ReorderBufferTXN
> *txn);
> +
> +static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
> +    ReorderBuffer *rb,
> +
>    ReorderBufferStreamIterTXNState * state);
> +
> +static void ReorderBufferStreamIterTXNFinish(
> +
> ReorderBuffer *rb,
> +
> ReorderBufferStreamIterTXNState * state);
>
> Do we really need to introduce new APIs for iterating over changes
> from streamed transactions?  Why can't we reuse the same API's as we
> use for committed xacts?
>
> 2.
> +static void
> +ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
>
> Please write some comments atop ReorderBufferStreamCommit.
>
> 3.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> ..
> + if (txn->snapshot_now
> == NULL)
> + {
> + dlist_iter subxact_i;
> +
> + /* make sure this transaction is streamed for the first time */
> +
> Assert(!rbtxn_is_streamed(txn));
> +
> + /* at the beginning we should have invalid command ID */
> + Assert(txn->command_id ==
> InvalidCommandId);
> +
> + dlist_foreach(subxact_i, &txn->subtxns)
> + {
> + ReorderBufferTXN *subtxn;
> +
> +
> subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> +
> + if (subtxn->base_snapshot != NULL &&
> +
> (txn->base_snapshot == NULL ||
> + txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
> + {
> +
> txn->base_snapshot = subtxn->base_snapshot;
>
> The logic here seems to be correct, but I am not sure why it is not
> considered to purge the base snapshot before assigning the subtxn's
> snapshot and similarly, we have not purged snapshot for subtxn once we
> are done with it.  I think we can use
> ReorderBufferTransferSnapToParent to replace part of the logic here.
> Do you see any reason for doing things differently here?
>
> 4. In ReorderBufferStreamTXN, why do you need to use
> ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.
>
> 5. I see a lot of code similarity in ReorderBufferStreamTXN and
> existing ReorderBufferCommit. I understand that there are some subtle
> differences due to which we need to write this new function but can't
> we encapsulate the specific parts of code in functions and then call
> from both places.  I am talking about code in different cases for
> change->action.
>
> 6. + * Note: We never stream and serialize a transaction at the same time (e
> /(e/(we
>
I will look into these comments and reply separately.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have rebased the patch set on the latest head.
>
> 0001 looks like a clever approach, but are you sure it doesn't hurt
> performance when many small XLOG records are being inserted? I think
> XLogRecordAssemble() can get pretty hot in some workloads.
>

I don't think we have evaluated it yet, but we should do it.  The
point to note is that it is only for the case when wal_level is
'logical' (see IsSubTransactionAssignmentPending) in which case we
already log more WAL, so this might not impact much.  I guess that it
might be better to have that check in XLogRecordAssemble for the sake
of clarity.

>
> Regarding 0005, it seems to me that this is no good:
>
> + errmsg("improper heap_getnext call")));
>
> I think we should be using elog() rather than ereport() here, because
> this should only happen if there's a bug in a logical decoding plugin.
> At first, I thought maybe this should just be an Assert(), but since
> there are third-party logical decoding plugins available, checking
> this even in non-assert builds seems like a good idea. However, I
> think making it translatable is overkill; users should never see this,
> only developers.
>

makes sense.  I think we should change it.

>
> + if (prev_lsn != InvalidXLogRecPtr)
> + Assert(prev_lsn <= change->lsn);
>
> There is no reason to ever write an if statement that contains only an
> Assert, and it's bad style. Write Assert(prev_lsn == InvalidXLogRecPtr
> || prev_lsn <= change->lsn), or better yet, use XLogRecPtrIsInvalid.
>

Agreed.

> The purpose and mechanism of the is_schema_sent flag is not clear to
> me. The word "schema" here seems to be being used to mean "snapshot,"
> which is rather confusing.
>

I have explained this flag below along with invalidations as both are
slightly related.

> I'm also somewhat unclear on what's happening here with invalidations.
> Perhaps that's as much a defect in my understanding as it is
> reflective of any problem with the patch, but I also don't see any
> comments either in 0002 or later patches explaining the theory of
> operation. If I've missed some, please point me in the right
> direction. Hypothetically speaking, it seems to me that if you just
> did InvalidateSystemCaches() every time the snapshot changed, you
> wouldn't need anything else (unless we're concerned with
> non-transactional invalidation messages like smgr and relmapper
> invalidations; not quite sure how those are handled). And, on the
> other hand, if we don't do InvalidateSystemCaches() every time the
> snapshot changes, then I don't understand why this works now, even
> without streaming.
>

I think the way invalidations work for logical replication is that
normally, we always start a new transaction before decoding each
commit which allows us to accept the invalidations (via
AtStart_Cache).  However, if there are catalog changes within the
transaction being decoded, we need to reflect those before trying to
decode the WAL of operation which happened after that catalog change.
As we are not logging the WAL for each invalidation, we need to
execute all the invalidation messages for this transaction at each
catalog change. We are able to do that now as we decode the entire WAL
for a transaction only once we get the commit's WAL which contains all
the invalidation messages.  So, we queue them up and execute them for
each catalog change which we identify by WAL record
XLOG_HEAP2_NEW_CID.

The second related concept is that before sending each change to
downstream (via pgoutput), we check whether we need to send the
schema.  This we decide based on the local map entry
(RelationSyncEntry) which indicates whether the schema for the
relation is already sent or not. Once the schema of the relation is
sent, the entry for that relation in the map will indicate it. At the
time of invalidation processing we also blew up this map, so it always
reflects the correct state.

Now, to decode an in-progress transaction, we need to ensure that we
have received the WAL for all the invalidations before decoding the
WAL of action that happened immediately after that catalog change.
This is the reason we started WAL logging individual Invalidations.
So, with this change we don't need to execute all the invalidations
for each catalog change, rather execute them as and when their WAL is
being decoded.

The current mechanism to send schema changes won't work for streaming
transactions because after sending the change, subtransaction might
abort.  On subtransaction abort, the downstream will simply discard
the changes where we will lose the previous schema change sent.  There
is no such problem currently because we process all the aborts before
sending any change.  So, the current idea of having a schema_sent flag
in each map entry (RelationSyncEntry) won't work for streaming
transactions.  To solve this problem initially patch has kept a flag
'is_schema_sent' for each top-level transaction (in ReorderBufferTXN)
so that we can always send a schema for each (sub)transaction for
streaming transactions, but that won't work if we access multiple
relations in the same subtransaction.  To solve this problem, we are
thinking of keeping a list/array of top-level xids in each
RelationSyncEntry.  Basically, whenever we send the schema for any
transaction, we note that in RelationSyncEntry and at abort/commit
time we can remove xid from the list.  Now, whenever, we check whether
to send schema for any operation in a transaction, we will check if
our xid is present in that list for a particular RelationSyncEntry and
take an action based on that (if xid is present, then we won't send
the schema, otherwise, send it). I think during decode, we should not
have that may open transactions, so the search in the array should be
cheap enough but we can consider some other data structure like hash
as well.

I will think some more and respond to your remaining comments/suggestions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have rebased the patch set on the latest head.
>
> 0001 looks like a clever approach, but are you sure it doesn't hurt
> performance when many small XLOG records are being inserted? I think
> XLogRecordAssemble() can get pretty hot in some workloads.
>
> With regard to 0002, logging a separate WAL record for each
> invalidation seems painful; I think most operations that generate
> invalidations generate a bunch of them all at once. Perhaps you could
> just queue up invalidations as they happen, and then force anything
> that's been queued up to be emitted into WAL just before you emit any
> WAL record that might need to be decoded.
>

I feel we can log the invalidations of the entire command at one go if
we log at CommandEndInvalidationMessages.  We already have all the
invalidations of current command in
transInvalInfo->CurrentCmdInvalidMsgs.  This can save us the effort of
maintaining a new separate list/queue for invalidations and to a good
extent, it will ameliorate your concern of logging each invalidation
separately.

>
> 0006 contains lots of XXX comments that look like real issues. I guess
> those need to be fixed. Also, why don't we do the thing that the
> commit message for 0006 says we could "theoretically" do? I don't
> understand why we need the k-way merge at all,
>

I think we can do what is written in the commit message, but then we
need to maintain two paths (one for streaming contexts and other for
non-streaming contexts) unless we want to entirely get rid of storing
subtransaction changes separately which seems like a more fundamental
change.  Right now, also to some extent such things are there, but I
have already given a comment to minimize it.  Having said that, I
think we can go either way.  I think the original intention was to
avoid doing more stuff unless it is really required as this is already
a big patchset, but maybe Tomas has a different idea about this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Dec 12, 2019 at 9:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have review the patch set and here are few comments/questions
> > >
> > > 1.
> > > +static void
> > > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > > + ReorderBufferTXN *txn,
> > > + Relation relation,
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > >
> > > Should we show the tuple in the streamed change like we do for the
> > > pg_decode_change?
> > >
> >
> > I think so.  The patch shows the message in
> > pg_decode_stream_message(), so why to prohibit showing tuple here?
> >
> > > 2. pg_logical_slot_get_changes_guts
> > > It recreate the decoding slot [ctx =
> > > CreateDecodingContext(InvalidXLogRecPtr] but doesn't set the streaming
> > > to false, should we pass a parameter to
> > > pg_logical_slot_get_changes_guts saying whether we want streamed results or not
> > >
> >
> > CreateDecodingContext internally calls StartupDecodingContext which
> > sets the value of streaming based on if the plugin has provided
> > callbacks for streaming functions. Isn't that sufficient?  Why do we
> > need additional parameters here?
>
> I don't think that if plugin provides streaming function then we
> should stream.  Like pgoutput plugin provides streaming function but
> we only stream if streaming is on in create subscription command.  So
> I feel that should be true with any plugin.
>

How about adding a new boolean parameter (streaming) in
pg_create_logical_replication_slot()?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Masahiko Sawada
Дата:
On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > I have rebased the patch on the latest head and also fix the issue of
> > > "concurrent abort handling of the (sub)transaction." and attached as
> > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > the complete patch set.  I have added the version number so that we
> > > can track the changes.
> >
> > The patch has rotten a bit and does not apply anymore.  Could you
> > please send a rebased version?  I have moved it to next CF, waiting on
> > author.
>
> I have rebased the patch set on the latest head.

Thank you for working on this.

This might have already been discussed but I have a question about the
changes of logical replication worker. In the current logical
replication there is a problem that the response time are doubled when
using synchronous replication because wal senders send changes after
commit. It's worse especially when a transaction makes a lot of
changes. So I expected this feature to reduce the response time by
sending changes even while the transaction is progressing but it
doesn't seem to be. The logical replication worker writes changes to
temporary files and applies these changes when the worker received
commit record (STREAM COMMIT). Since the worker sends the LSN of
commit record as flush LSN to the publisher after applying all
changes, the publisher must wait for all changes are applied to the
subscriber.  Another problem would be that the worker doesn't receive
changes during applying changes of other transactions. These things
make me think it's better to have a new worker dedicated to apply
changes like we have the wal receiver process and the startup process.
Maybe we can have 2 workers (receiver and applyer) per subscriptions.
Any thoughts?

Regards,


--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Kyotaro Horiguchi
Дата:
Hello.

At Fri, 13 Dec 2019 14:46:20 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I have rebased the patch set on the latest head.
> >
> > 0001 looks like a clever approach, but are you sure it doesn't hurt
> > performance when many small XLOG records are being inserted? I think
> > XLogRecordAssemble() can get pretty hot in some workloads.
> >
> > With regard to 0002, logging a separate WAL record for each
> > invalidation seems painful; I think most operations that generate
> > invalidations generate a bunch of them all at once. Perhaps you could
> > just queue up invalidations as they happen, and then force anything
> > that's been queued up to be emitted into WAL just before you emit any
> > WAL record that might need to be decoded.
> >
> 
> I feel we can log the invalidations of the entire command at one go if
> we log at CommandEndInvalidationMessages.  We already have all the
> invalidations of current command in
> transInvalInfo->CurrentCmdInvalidMsgs.  This can save us the effort of
> maintaining a new separate list/queue for invalidations and to a good
> extent, it will ameliorate your concern of logging each invalidation
> separately.

I have a question on this. Does that mean that the current logical
decoder (or reorderbuffer) may emit incorrect result if it made a
catalog change during the current transaction being decoded? If so,
this is not a feature but a bug fix.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > > I have rebased the patch on the latest head and also fix the issue of
> > > > "concurrent abort handling of the (sub)transaction." and attached as
> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > > the complete patch set.  I have added the version number so that we
> > > > can track the changes.
> > >
> > > The patch has rotten a bit and does not apply anymore.  Could you
> > > please send a rebased version?  I have moved it to next CF, waiting on
> > > author.
> >
> > I have rebased the patch set on the latest head.
>
> Thank you for working on this.
>
> This might have already been discussed but I have a question about the
> changes of logical replication worker. In the current logical
> replication there is a problem that the response time are doubled when
> using synchronous replication because wal senders send changes after
> commit. It's worse especially when a transaction makes a lot of
> changes. So I expected this feature to reduce the response time by
> sending changes even while the transaction is progressing but it
> doesn't seem to be. The logical replication worker writes changes to
> temporary files and applies these changes when the worker received
> commit record (STREAM COMMIT). Since the worker sends the LSN of
> commit record as flush LSN to the publisher after applying all
> changes, the publisher must wait for all changes are applied to the
> subscriber.
>

The main aim of this feature is to reduce apply lag.  Because if we
send all the changes together it can delay there apply because of
network delay, whereas if most of the changes are already sent, then
we will save the effort on sending the entire data at commit time.
This in itself gives us decent benefits.  Sure, we can further improve
it by having separate workers (dedicated to apply the changes) as you
are suggesting and in fact, there is a patch for that as well(see the
performance results and bgworker patch at [1]), but if try to shove in
all the things in one go, then it will be difficult to get this patch
committed (there are already enough things and the patch is quite big
that to get it right takes a lot of energy).  So, the plan is
something like that first we get the basic feature and then try to
improve by having dedicated workers or things like that.  Does this
make sense to you?

[1] - https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Dec 20, 2019 at 2:00 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> At Fri, 13 Dec 2019 14:46:20 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > On Wed, Dec 11, 2019 at 11:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On Mon, Dec 2, 2019 at 3:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > I have rebased the patch set on the latest head.
> > >
> > > 0001 looks like a clever approach, but are you sure it doesn't hurt
> > > performance when many small XLOG records are being inserted? I think
> > > XLogRecordAssemble() can get pretty hot in some workloads.
> > >
> > > With regard to 0002, logging a separate WAL record for each
> > > invalidation seems painful; I think most operations that generate
> > > invalidations generate a bunch of them all at once. Perhaps you could
> > > just queue up invalidations as they happen, and then force anything
> > > that's been queued up to be emitted into WAL just before you emit any
> > > WAL record that might need to be decoded.
> > >
> >
> > I feel we can log the invalidations of the entire command at one go if
> > we log at CommandEndInvalidationMessages.  We already have all the
> > invalidations of current command in
> > transInvalInfo->CurrentCmdInvalidMsgs.  This can save us the effort of
> > maintaining a new separate list/queue for invalidations and to a good
> > extent, it will ameliorate your concern of logging each invalidation
> > separately.
>
> I have a question on this. Does that mean that the current logical
> decoder (or reorderbuffer)
>

What does currently refer to here?  Is it about HEAD or about the
patch?   Without the patch, we decode only at commit time and by that
time we have all invalidations (logged with commit WAL record), so we
just execute them at each catalog change (see the actions in
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID).  The patch has to
separately WAL log each invalidation because we can decode the
intermittent changes, so we can't wait till commit.  The above is just
an optimization for the patch.  AFAIK, there is no correctness issue
here, but let me know if you see any.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
vignesh C
Дата:
On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > I have rebased the patch on the latest head and also fix the issue of
> > > "concurrent abort handling of the (sub)transaction." and attached as
> > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > the complete patch set.  I have added the version number so that we
> > > can track the changes.
> >
> > The patch has rotten a bit and does not apply anymore.  Could you
> > please send a rebased version?  I have moved it to next CF, waiting on
> > author.
>
> I have rebased the patch set on the latest head.
>

Few comments:
assert variable should be within #ifdef USE_ASSERT_CHECKING in patch
v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
+               int64           subidx;
+               bool            found = false;
+               char            path[MAXPGPATH];
+
+               subidx = -1;
+               subxact_info_read(MyLogicalRepWorker->subid, xid);
+
+               /* FIXME optimize the search by bsearch on sorted data */
+               for (i = nsubxacts; i > 0; i--)
+               {
+                       if (subxacts[i - 1].xid == subxid)
+                       {
+                               subidx = (i - 1);
+                               found = true;
+                               break;
+                       }
+               }
+
+               /* We should not receive aborts for unknown subtransactions. */
+               Assert(found);

Add the typedefs like below in typedefs.lst common across the patches:
xl_xact_invalidations, ReorderBufferStreamIterTXNEntry,
ReorderBufferStreamIterTXNState, SubXactInfo

"are written" appears twice in commit message of
v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
The individual invalidations are written are written using a new
xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource
manager. See LogLogicalInvalidations for details.

v2-0002-Issue-individual-invalidations-with-wal_level-log.patch patch
does not compile by itself:
reorderbuffer.c:1822:9: error: ‘ReorderBufferTXN’ has no member named
‘is_schema_sent’
+
LocalExecuteInvalidationMessage(&change->data.inval.msg);
+                                       txn->is_schema_sent = false;
+                                       break;

Should we include printing of id here like in earlier cases in
v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
+                       appendStringInfo(buf, " relcache %u", msg->rc.relId);
+               /* not expected, but print something anyway */
+               else if (msg->id == SHAREDINVALSMGR_ID)
+                       appendStringInfoString(buf, " smgr");
+               /* not expected, but print something anyway */
+               else if (msg->id == SHAREDINVALRELMAP_ID)
+                       appendStringInfo(buf, " relmap db %u", msg->rm.dbId);

There is some code duplication in stream_change_cb_wrapper,
stream_truncate_cb_wrapper, stream_message_cb_wrapper,
stream_abort_cb_wrapper, stream_commit_cb_wrapper,
stream_start_cb_wrapper and stream_stop_cb_wrapper functions in
v2-0003-Extend-the-output-plugin-API-with-stream-methods.patch patch.
Should we have a separate function for common code?

Should we can add function header for AssertChangeLsnOrder in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{

This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the
loop, can be checked only once:
+       dlist_foreach(iter, &txn->changes)
+       {
+               ReorderBufferChange *cur_change;
+
+               cur_change = dlist_container(ReorderBufferChange,
node, iter.cur);
+
+               Assert(txn->first_lsn != InvalidXLogRecPtr);
+               Assert(cur_change->lsn != InvalidXLogRecPtr);
+               Assert(txn->first_lsn <= cur_change->lsn);

Should we add function header for ReorderBufferDestroyTupleCidHash in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+ReorderBufferDestroyTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+       if (txn->tuplecid_hash != NULL)
+       {
+               hash_destroy(txn->tuplecid_hash);
+               txn->tuplecid_hash = NULL;
+       }
+}
+

Should we add function header for ReorderBufferStreamCommit in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static void
+ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+       /* we should only call this for previously streamed transactions */
+       Assert(rbtxn_is_streamed(txn));
+
+       ReorderBufferStreamTXN(rb, txn);
+
+       rb->stream_commit(rb, txn, txn->final_lsn);
+
+       ReorderBufferCleanupTXN(rb, txn);
+}
+

Should we add function header for ReorderBufferCanStream in
v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+       LogicalDecodingContext *ctx = rb->private_data;
+
+       return ctx->streaming;
+}

patch v2-0008-Add-support-for-streaming-to-built-in-replication.patch
does not apply:
Hunk #18 FAILED at 2035.
Hunk #19 succeeded at 2199 (offset -16 lines).
1 out of 19 hunks FAILED -- saving rejects to file
src/backend/replication/logical/worker.c.rej

Header inclusion may not be required in patch
v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, Dec 22, 2019 at 5:04 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Few comments:
> assert variable should be within #ifdef USE_ASSERT_CHECKING in patch
> v2-0008-Add-support-for-streaming-to-built-in-replication.patch:
> +               int64           subidx;
> +               bool            found = false;
> +               char            path[MAXPGPATH];
> +
> +               subidx = -1;
> +               subxact_info_read(MyLogicalRepWorker->subid, xid);
> +
> +               /* FIXME optimize the search by bsearch on sorted data */
> +               for (i = nsubxacts; i > 0; i--)
> +               {
> +                       if (subxacts[i - 1].xid == subxid)
> +                       {
> +                               subidx = (i - 1);
> +                               found = true;
> +                               break;
> +                       }
> +               }
> +
> +               /* We should not receive aborts for unknown subtransactions. */
> +               Assert(found);
>

We can use PG_USED_FOR_ASSERTS_ONLY for that variable.

>
> Should we include printing of id here like in earlier cases in
> v2-0002-Issue-individual-invalidations-with-wal_level-log.patch:
> +                       appendStringInfo(buf, " relcache %u", msg->rc.relId);
> +               /* not expected, but print something anyway */
> +               else if (msg->id == SHAREDINVALSMGR_ID)
> +                       appendStringInfoString(buf, " smgr");
> +               /* not expected, but print something anyway */
> +               else if (msg->id == SHAREDINVALRELMAP_ID)
> +                       appendStringInfo(buf, " relmap db %u", msg->rm.dbId);
>

I am not sure if this patch is logging these invalidations, so not
sure if it makes sense to add more ids in the cases you are referring
to.  However, if we change it to logging all invalidations at command
end as being discussed in this thread, then it might be better to do
what you are suggesting.

>
> Should we can add function header for AssertChangeLsnOrder in
> v2-0006-Implement-streaming-mode-in-ReorderBuffer.patch:
> +static void
> +AssertChangeLsnOrder(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +{
>
> This "Assert(txn->first_lsn != InvalidXLogRecPtr)"can be before the
> loop, can be checked only once:
> +       dlist_foreach(iter, &txn->changes)
> +       {
> +               ReorderBufferChange *cur_change;
> +
> +               cur_change = dlist_container(ReorderBufferChange,
> node, iter.cur);
> +
> +               Assert(txn->first_lsn != InvalidXLogRecPtr);
> +               Assert(cur_change->lsn != InvalidXLogRecPtr);
> +               Assert(txn->first_lsn <= cur_change->lsn);
>

This makes sense to me.  Another thing about this function, do we
really need "ReorderBuffer *rb" parameter in this function?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Robert Haas
Дата:
On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I don't think we have evaluated it yet, but we should do it.  The
> point to note is that it is only for the case when wal_level is
> 'logical' (see IsSubTransactionAssignmentPending) in which case we
> already log more WAL, so this might not impact much.  I guess that it
> might be better to have that check in XLogRecordAssemble for the sake
> of clarity.

I don't think that this is really a valid argument. Just because we
have some overhead now doesn't mean that adding more won't hurt. Even
testing the wal_level costs a little something.

> I think the way invalidations work for logical replication is that
> normally, we always start a new transaction before decoding each
> commit which allows us to accept the invalidations (via
> AtStart_Cache).  However, if there are catalog changes within the
> transaction being decoded, we need to reflect those before trying to
> decode the WAL of operation which happened after that catalog change.
> As we are not logging the WAL for each invalidation, we need to
> execute all the invalidation messages for this transaction at each
> catalog change. We are able to do that now as we decode the entire WAL
> for a transaction only once we get the commit's WAL which contains all
> the invalidation messages.  So, we queue them up and execute them for
> each catalog change which we identify by WAL record
> XLOG_HEAP2_NEW_CID.

Thanks for the explanation. That makes sense. But, it's still true,
AFAICS, that instead of doing this stuff with logging invalidations
you could just InvalidateSystemCaches() in the cases where you are
currently applying all of the transaction's invalidations. That
approach might be worse than changing the way invalidations are
logged, but the two approaches deserve to be compared. One approach
has more CPU overhead and the other has more WAL overhead, so it's a
little hard to compare them, but it seems worth mulling over.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Masahiko Sawada
Дата:
On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 20, 2019 at 11:47 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Mon, 2 Dec 2019 at 17:32, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> > > >
> > > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > > > > I have rebased the patch on the latest head and also fix the issue of
> > > > > "concurrent abort handling of the (sub)transaction." and attached as
> > > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > > > > the complete patch set.  I have added the version number so that we
> > > > > can track the changes.
> > > >
> > > > The patch has rotten a bit and does not apply anymore.  Could you
> > > > please send a rebased version?  I have moved it to next CF, waiting on
> > > > author.
> > >
> > > I have rebased the patch set on the latest head.
> >
> > Thank you for working on this.
> >
> > This might have already been discussed but I have a question about the
> > changes of logical replication worker. In the current logical
> > replication there is a problem that the response time are doubled when
> > using synchronous replication because wal senders send changes after
> > commit. It's worse especially when a transaction makes a lot of
> > changes. So I expected this feature to reduce the response time by
> > sending changes even while the transaction is progressing but it
> > doesn't seem to be. The logical replication worker writes changes to
> > temporary files and applies these changes when the worker received
> > commit record (STREAM COMMIT). Since the worker sends the LSN of
> > commit record as flush LSN to the publisher after applying all
> > changes, the publisher must wait for all changes are applied to the
> > subscriber.
> >
>
> The main aim of this feature is to reduce apply lag.  Because if we
> send all the changes together it can delay there apply because of
> network delay, whereas if most of the changes are already sent, then
> we will save the effort on sending the entire data at commit time.
> This in itself gives us decent benefits.  Sure, we can further improve
> it by having separate workers (dedicated to apply the changes) as you
> are suggesting and in fact, there is a patch for that as well(see the
> performance results and bgworker patch at [1]), but if try to shove in
> all the things in one go, then it will be difficult to get this patch
> committed (there are already enough things and the patch is quite big
> that to get it right takes a lot of energy).  So, the plan is
> something like that first we get the basic feature and then try to
> improve by having dedicated workers or things like that.  Does this
> make sense to you?
>

Thank you for explanation. The plan makes sense. But I think in the
current design it's a problem that logical replication worker doesn't
receive changes (and doesn't check interrupts) during applying
committed changes even if we don't have a worker dedicated for
applying. I think the worker should continue to receive changes and
save them to temporary files even during applying changes. Otherwise
the buffer would be easily full and replication gets stuck.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > The main aim of this feature is to reduce apply lag.  Because if we
> > send all the changes together it can delay there apply because of
> > network delay, whereas if most of the changes are already sent, then
> > we will save the effort on sending the entire data at commit time.
> > This in itself gives us decent benefits.  Sure, we can further improve
> > it by having separate workers (dedicated to apply the changes) as you
> > are suggesting and in fact, there is a patch for that as well(see the
> > performance results and bgworker patch at [1]), but if try to shove in
> > all the things in one go, then it will be difficult to get this patch
> > committed (there are already enough things and the patch is quite big
> > that to get it right takes a lot of energy).  So, the plan is
> > something like that first we get the basic feature and then try to
> > improve by having dedicated workers or things like that.  Does this
> > make sense to you?
> >
>
> Thank you for explanation. The plan makes sense. But I think in the
> current design it's a problem that logical replication worker doesn't
> receive changes (and doesn't check interrupts) during applying
> committed changes even if we don't have a worker dedicated for
> applying. I think the worker should continue to receive changes and
> save them to temporary files even during applying changes.
>

Won't it beat the purpose of this feature which is to reduce the apply
lag?  Basically, it can so happen that while applying commit, it
constantly gets changes of other transactions which will delay the
apply of the current transaction.  Also, won't it create some further
work to identify the order of commits?  Say while applying commit-1,
it receives 5 other commits that are written to separate temporary
files.  How will we later identify which transaction's WAL we need to
apply first?  We might deduce by LSN's, but I think that could be
tricky.  Another thing is that I think it could lead to some design
complications as well because while applying commit, you need some
sort of callback or something like that to receive and flush totally
unrelated changes.  It could lead to another kind of failure mode
wherein while applying commit if it tries to receive another
transaction data and some failure happens while writing the data of
that transaction.  I am not sure if it is a good idea to try something
like that.

> Otherwise
> the buffer would be easily full and replication gets stuck.
>

Are you telling about network buffer?  I think the best way as
discussed is to launch new workers for streamed transactions, but we
can do that as an additional feature. Anyway, as proposed, users can
choose the streaming mode for subscriptions, so there is an option to
turn this selectively.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Masahiko Sawada
Дата:
On Tue, 24 Dec 2019 at 17:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Dec 24, 2019 at 11:17 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Fri, 20 Dec 2019 at 22:30, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > The main aim of this feature is to reduce apply lag.  Because if we
> > > send all the changes together it can delay there apply because of
> > > network delay, whereas if most of the changes are already sent, then
> > > we will save the effort on sending the entire data at commit time.
> > > This in itself gives us decent benefits.  Sure, we can further improve
> > > it by having separate workers (dedicated to apply the changes) as you
> > > are suggesting and in fact, there is a patch for that as well(see the
> > > performance results and bgworker patch at [1]), but if try to shove in
> > > all the things in one go, then it will be difficult to get this patch
> > > committed (there are already enough things and the patch is quite big
> > > that to get it right takes a lot of energy).  So, the plan is
> > > something like that first we get the basic feature and then try to
> > > improve by having dedicated workers or things like that.  Does this
> > > make sense to you?
> > >
> >
> > Thank you for explanation. The plan makes sense. But I think in the
> > current design it's a problem that logical replication worker doesn't
> > receive changes (and doesn't check interrupts) during applying
> > committed changes even if we don't have a worker dedicated for
> > applying. I think the worker should continue to receive changes and
> > save them to temporary files even during applying changes.
> >
>
> Won't it beat the purpose of this feature which is to reduce the apply
> lag?  Basically, it can so happen that while applying commit, it
> constantly gets changes of other transactions which will delay the
> apply of the current transaction.

You're right. But it seems to me that it optimizes the apply lags of
only a transaction that made many changes. On the other hand if a
transaction made many changes applying of subsequent changes are
delayed.

>  Also, won't it create some further
> work to identify the order of commits?  Say while applying commit-1,
> it receives 5 other commits that are written to separate temporary
> files.  How will we later identify which transaction's WAL we need to
> apply first?  We might deduce by LSN's, but I think that could be
> tricky.  Another thing is that I think it could lead to some design
> complications as well because while applying commit, you need some
> sort of callback or something like that to receive and flush totally
> unrelated changes.  It could lead to another kind of failure mode
> wherein while applying commit if it tries to receive another
> transaction data and some failure happens while writing the data of
> that transaction.  I am not sure if it is a good idea to try something
> like that.

It's just an idea but we might want to have new workers dedicated to
apply changes first and then we will have streaming option later. That
way we can reduce the flush lags depending on use cases. The commit
order can be  determined by the receiver and shared with the applyer
in shared memory. Once we separated workers the streaming option can
be introduced without such a downside.

>
> > Otherwise
> > the buffer would be easily full and replication gets stuck.
> >
>
> Are you telling about network buffer?

Yes.

>   I think the best way as
> discussed is to launch new workers for streamed transactions, but we
> can do that as an additional feature. Anyway, as proposed, users can
> choose the streaming mode for subscriptions, so there is an option to
> turn this selectively.

Yes. But user who wants to use this feature would want to replicate
many changes but I guess the side effect is quite big. I think that at
least we need to make the logical replication tolerate such situation.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote:
>On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
>> > >
>> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
>> > > > I have rebased the patch on the latest head and also fix the issue of
>> > > > "concurrent abort handling of the (sub)transaction." and attached as
>> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
>> > > > the complete patch set.  I have added the version number so that we
>> > > > can track the changes.
>> > >
>> > > The patch has rotten a bit and does not apply anymore.  Could you
>> > > please send a rebased version?  I have moved it to next CF, waiting on
>> > > author.
>> >
>> > I have rebased the patch set on the latest head.
>> >
>> > Apart from this, there is one issue reported by my colleague Vignesh.
>> > The issue is that if we use more than two relations in a transaction
>> > then there is an error on standby (no relation map entry for remote
>> > relation ID 16390).  After analyzing I have found that for the
>> > streaming transaction an "is_schema_sent" flag is kept in
>> > ReorderBufferTXN.  And, I think that is done so that we can send the
>> > schema for each transaction stream so that if any subtransaction gets
>> > aborted we don't lose the logical WAL for that schema.  But, this
>> > solution has induced a very basic issue that if a transaction operate
>> > on more than 1 relation then after sending the schema for the first
>> > relation it will mark the flag true and the schema for the subsequent
>> > relations will never be sent.
>> >
>>
>> How about keeping a list of top-level xids in each RelationSyncEntry?
>> Basically, whenever we send the schema for any transaction, we note
>> that in RelationSyncEntry and at abort time we can remove xid from the
>> list.  Now, whenever, we check whether to send schema for any
>> operation in a transaction, we will check if our xid is present in
>> that list for a particular RelationSyncEntry and take an action based
>> on that (if xid is present, then we won't send the schema, otherwise,
>> send it).
>The idea make sense to me.  I will try to write a patch for this and test.
>

Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
needs to be in the RelationSyncEntry. In fact, I already have code for
that in my private repository - I thought the patches I sent here do
include this, but apparently I forgot to include this bit :-(

Attached is a rebased patch series, fixing this. It's essentially v2
with a couple of patches (0003, 0008, 0009 and 0012) replacing the
is_schema_sent with correct handling.


0003 - removes an is_schema_sent reference added prematurely (it's added
by a later patch, causing compile failure)

0008 - adds the is_schema_sent back (essentially reverting 0003)

0009 - removes is_schema_sent entirely

0012 - adds the correct handling of schema flags in pgoutput


I don't know what other changes you've made since v2, so this way it
should be possible to just take 0003, 0008, 0009 and 0012 and slip them
in with minimal hassle.

FWIW thanks to everyone (and Amit and Dilip in particular) working on
this patch series.  There's been a lot of great reviews and improvements
since I abandoned this thread for a while. I expect to be able to spend
more time working on this in January.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote:
> >On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> >
> >> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> >> > >
> >> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> >> > > > I have rebased the patch on the latest head and also fix the issue of
> >> > > > "concurrent abort handling of the (sub)transaction." and attached as
> >> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> >> > > > the complete patch set.  I have added the version number so that we
> >> > > > can track the changes.
> >> > >
> >> > > The patch has rotten a bit and does not apply anymore.  Could you
> >> > > please send a rebased version?  I have moved it to next CF, waiting on
> >> > > author.
> >> >
> >> > I have rebased the patch set on the latest head.
> >> >
> >> > Apart from this, there is one issue reported by my colleague Vignesh.
> >> > The issue is that if we use more than two relations in a transaction
> >> > then there is an error on standby (no relation map entry for remote
> >> > relation ID 16390).  After analyzing I have found that for the
> >> > streaming transaction an "is_schema_sent" flag is kept in
> >> > ReorderBufferTXN.  And, I think that is done so that we can send the
> >> > schema for each transaction stream so that if any subtransaction gets
> >> > aborted we don't lose the logical WAL for that schema.  But, this
> >> > solution has induced a very basic issue that if a transaction operate
> >> > on more than 1 relation then after sending the schema for the first
> >> > relation it will mark the flag true and the schema for the subsequent
> >> > relations will never be sent.
> >> >
> >>
> >> How about keeping a list of top-level xids in each RelationSyncEntry?
> >> Basically, whenever we send the schema for any transaction, we note
> >> that in RelationSyncEntry and at abort time we can remove xid from the
> >> list.  Now, whenever, we check whether to send schema for any
> >> operation in a transaction, we will check if our xid is present in
> >> that list for a particular RelationSyncEntry and take an action based
> >> on that (if xid is present, then we won't send the schema, otherwise,
> >> send it).
> >The idea make sense to me.  I will try to write a patch for this and test.
> >
>
> Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
> needs to be in the RelationSyncEntry. In fact, I already have code for
> that in my private repository - I thought the patches I sent here do
> include this, but apparently I forgot to include this bit :-(
>
> Attached is a rebased patch series, fixing this. It's essentially v2
> with a couple of patches (0003, 0008, 0009 and 0012) replacing the
> is_schema_sent with correct handling.
>
>
> 0003 - removes an is_schema_sent reference added prematurely (it's added
> by a later patch, causing compile failure)
>
> 0008 - adds the is_schema_sent back (essentially reverting 0003)
>
> 0009 - removes is_schema_sent entirely
>
> 0012 - adds the correct handling of schema flags in pgoutput
>
>
> I don't know what other changes you've made since v2, so this way it
> should be possible to just take 0003, 0008, 0009 and 0012 and slip them
> in with minimal hassle.
>
> FWIW thanks to everyone (and Amit and Dilip in particular) working on
> this patch series.  There's been a lot of great reviews and improvements
> since I abandoned this thread for a while. I expect to be able to spend
> more time working on this in January.
>
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+ MemoryContextSwitchTo(oldctx);
+}
I was looking into the schema tracking solution and I have one
question, Shouldn't we remove the topxid from the list if the
(sub)transaction is aborted? because once it is aborted we need to
resent the schema.  I think we can remove the xid from the list in the
cleanup_rel_sync_cache function?


I have observed some more issues

1. Currently, In ReorderBufferCommit, it is always expected that
whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
SPEC_CONFIRM we send the tuple we got in SPECT_INSERT.  But, now those
two messages can be in different streams.  So we need to find a way to
handle this.  Maybe once we get SPEC_INSERT then we can remember the
tuple and then if we get the SPECT_CONFIRM in the next stream we can
send that tuple?

2. During commit time in DecodeCommit we check whether we need to skip
the changes of the transaction or not by calling
SnapBuildXactNeedsSkip but since now we support streaming so it's
possible that before we decode the commit WAL, we might have already
sent the changes to the output plugin even though we could have
skipped those changes.  So my question is instead of checking at the
commit time can't we check before adding to ReorderBuffer itself or we
can truncate the changes if SnapBuildXactNeedsSkip is true whenever
logical_decoding_workmem limit is reached.  Am I missing something
here?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
Yesterday, Tomas has posted the latest version of the patch set which
contain the fix for schema send part.  Meanwhile, I was working on few
review comments/bugfixes and refactoring.  I have tried to merge those
changes with the latest patch set except the refactoring related to
"0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
has also made some changes in the same patch.  I have created a
separate patch for the same so that we can review the changes and then
we can merge them to the main patch.

> On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have review the patch set and here are few comments/questions
> > >
> > > 1.
> > > +static void
> > > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > > + ReorderBufferTXN *txn,
> > > + Relation relation,
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > >
> > > Should we show the tuple in the streamed change like we do for the
> > > pg_decode_change?
> > >
> >
> > I think so.  The patch shows the message in
> > pg_decode_stream_message(), so why to prohibit showing tuple here?

Yeah, we can do that.  One option is that we can directly register
"pg_decode_change" function as stream_change_cb plugin and that will
show the tuple, another option is that we can write a similar function
as pg_decode_change and change the message which includes the text
"STREAM" so that the user can distinguish between tuple from committed
transaction and the in-progress transaction.

While analyzing this solution I have encountered one more issue, the
problem is that currently, during commit time in DecodeCommit we check
whether we need to skip the changes of the transaction or not by
calling SnapBuildXactNeedsSkip but since now we support streaming so
it's possible that before commit wal arrive we might have already sent
the changes to the output plugin even though we could have skipped
those changes.  So my question is instead of checking at the commit
time can't we check before adding to ReorderBuffer itself or we can
truncate the changes if SnapBuildXactNeedsSkip is true whenever
logical_decoding_workmem limit is reached.

> > Few comments on this patch series:
> >
> > 0001-Immediately-WAL-log-assignments:
> > ------------------------------------------------------------
> >
> > The commit message still refers to the old design for this patch.  I
> > think you need to modify the commit message as per the latest patch.
Done
> >
> > 0002-Issue-individual-invalidations-with-wal_level-log
> > ----------------------------------------------------------------------------
> > 1.
> > xact_desc_invalidations(StringInfo buf,
> > {
> > ..
> > + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> > + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
> >
> > You have removed logging for the above cache but forgot to remove its
> > reference from one of the places.  Also, I think you need to add a
> > comment somewhere in inval.c to say why you are writing for WAL for
> > some types of invalidations and not for others?
Done
> >
> > 0003-Extend-the-output-plugin-API-with-stream-methods
> > --------------------------------------------------------------------------------
> > 1.
> > +     are required, while <function>stream_message_cb</function> and
> > +     <function>stream_message_cb</function> are optional.
> >
> > stream_message_cb is mentioned twice.  It seems the second one is for truncate.
Done
> >
> > 2.
> > size of the transaction size and network bandwidth, the transfer time
> > +    may significantly increase the apply lag.
> >
> > /size of the transaction size/size of the transaction
> >
> > no need to mention size twice.
Done
> >
> > 3.
> > +    Similarly to spill-to-disk behavior, streaming is triggered when the total
> > +    amount of changes decoded from the WAL (for all in-progress
> > transactions)
> > +    exceeds limit defined by <varname>logical_work_mem</varname> setting.
> >
> > The guc name used is wrong.  /Similarly to/Similar to/
Done
> >
> > 4.
> > stream_start_cb_wrapper()
> > {
> > ..
> > + /* state.report_location = apply_lsn; */
> > ..
> > + /* FIXME ctx->write_location = apply_lsn; */
> > ..
> > }
> >
> > See, if we can fix these and similar in the callback for the stop.  I
> > think we don't have final_lsn till we commit/abort.  Can we compute
> > before calling these API's?
Done
> >
> >
> > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> > ----------------------------------------------------------------------------------
> > 1.
> > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> >   PG_CATCH();
> >   {
> >   /* TODO: Encapsulate cleanup
> > from the PG_TRY and PG_CATCH blocks */
> > +
> >   if (iterstate)
> >   ReorderBufferIterTXNFinish(rb, iterstate);
> >
> > Spurious line change.
> >
Done
> > 2. The commit message of this patch refers to Prepared transactions.
> > I think that needs to be changed.
> >
> > 0006-Implement-streaming-mode-in-ReorderBuffer
> > -------------------------------------------------------------------------
> > 1.
> > +
> > +/* iterator for streaming (only get data from memory) */
> > +static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
> > +
> > ReorderBuffer *rb,
> > +
> > ReorderBufferTXN
> > *txn);
> > +
> > +static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
> > +    ReorderBuffer *rb,
> > +
> >    ReorderBufferStreamIterTXNState * state);
> > +
> > +static void ReorderBufferStreamIterTXNFinish(
> > +
> > ReorderBuffer *rb,
> > +
> > ReorderBufferStreamIterTXNState * state);
> >
> > Do we really need to introduce new APIs for iterating over changes
> > from streamed transactions?  Why can't we reuse the same API's as we
> > use for committed xacts?
Done
> >
> > 2.
> > +static void
> > +ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
> >
> > Please write some comments atop ReorderBufferStreamCommit.
Done
> >
> > 3.
> > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > {
> > ..
> > ..
> > + if (txn->snapshot_now
> > == NULL)
> > + {
> > + dlist_iter subxact_i;
> > +
> > + /* make sure this transaction is streamed for the first time */
> > +
> > Assert(!rbtxn_is_streamed(txn));
> > +
> > + /* at the beginning we should have invalid command ID */
> > + Assert(txn->command_id ==
> > InvalidCommandId);
> > +
> > + dlist_foreach(subxact_i, &txn->subtxns)
> > + {
> > + ReorderBufferTXN *subtxn;
> > +
> > +
> > subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> > +
> > + if (subtxn->base_snapshot != NULL &&
> > +
> > (txn->base_snapshot == NULL ||
> > + txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
> > + {
> > +
> > txn->base_snapshot = subtxn->base_snapshot;
> >
> > The logic here seems to be correct, but I am not sure why it is not
> > considered to purge the base snapshot before assigning the subtxn's
> > snapshot and similarly, we have not purged snapshot for subtxn once we
> > are done with it.  I think we can use
> > ReorderBufferTransferSnapToParent to replace part of the logic here.
> > Do you see any reason for doing things differently here?
Done
> >
> > 4. In ReorderBufferStreamTXN, why do you need to use
> > ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.

IMHO, here instead of directly copying the base snapshot we are
modifying it by passing command id and thats the reason we are copying
it.
> >
> > 5. I see a lot of code similarity in ReorderBufferStreamTXN and
> > existing ReorderBufferCommit. I understand that there are some subtle
> > differences due to which we need to write this new function but can't
> > we encapsulate the specific parts of code in functions and then call
> > from both places.  I am talking about code in different cases for
> > change->action.
Done
> >
> > 6. + * Note: We never stream and serialize a transaction at the same time (e
> > /(e/(we
Done

I have also found one bug in
"v3-0012-fixup-add-proper-schema-tracking.patch" due to which some of
the streaming test cases were failing, I have created a separate patch
to fix the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have observed some more issues
>
> 1. Currently, In ReorderBufferCommit, it is always expected that
> whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
> have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
> SPEC_CONFIRM we send the tuple we got in SPECT_INSERT.  But, now those
> two messages can be in different streams.  So we need to find a way to
> handle this.  Maybe once we get SPEC_INSERT then we can remember the
> tuple and then if we get the SPECT_CONFIRM in the next stream we can
> send that tuple?
>

Your suggestion makes sense to me.  So, we can try it.

> 2. During commit time in DecodeCommit we check whether we need to skip
> the changes of the transaction or not by calling
> SnapBuildXactNeedsSkip but since now we support streaming so it's
> possible that before we decode the commit WAL, we might have already
> sent the changes to the output plugin even though we could have
> skipped those changes.  So my question is instead of checking at the
> commit time can't we check before adding to ReorderBuffer itself
>

I think if we can do that then the same will be true for current code
irrespective of this patch.  I think it is possible that we can't take
that decision while decoding because we haven't assembled a consistent
snapshot yet.  I think we might be able to do that while we try to
stream the changes.  I think we need to take care of all the
conditions during streaming (when the logical_decoding_workmem limit
is reached) as we do in DecodeCommit.  This needs a bit more study.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Dec 26, 2019 at 12:36 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 24 Dec 2019 at 17:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > Thank you for explanation. The plan makes sense. But I think in the
> > > current design it's a problem that logical replication worker doesn't
> > > receive changes (and doesn't check interrupts) during applying
> > > committed changes even if we don't have a worker dedicated for
> > > applying. I think the worker should continue to receive changes and
> > > save them to temporary files even during applying changes.
> > >
> >
> > Won't it beat the purpose of this feature which is to reduce the apply
> > lag?  Basically, it can so happen that while applying commit, it
> > constantly gets changes of other transactions which will delay the
> > apply of the current transaction.
>
> You're right. But it seems to me that it optimizes the apply lags of
> only a transaction that made many changes. On the other hand if a
> transaction made many changes applying of subsequent changes are
> delayed.
>

Hmm, how would it be worse than the current situation where once
commit is encountered on the publisher, we won't start with other
transactions until the replay of the same is finished on subscriber?

>
> >   I think the best way as
> > discussed is to launch new workers for streamed transactions, but we
> > can do that as an additional feature. Anyway, as proposed, users can
> > choose the streaming mode for subscriptions, so there is an option to
> > turn this selectively.
>
> Yes. But user who wants to use this feature would want to replicate
> many changes but I guess the side effect is quite big. I think that at
> least we need to make the logical replication tolerate such situation.
>

What exactly you mean by "at least we need to make the logical
replication tolerate such situation."?  Do you have something specific
in mind?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have observed some more issues
> >
> > 1. Currently, In ReorderBufferCommit, it is always expected that
> > whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
> > have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
> > SPEC_CONFIRM we send the tuple we got in SPECT_INSERT.  But, now those
> > two messages can be in different streams.  So we need to find a way to
> > handle this.  Maybe once we get SPEC_INSERT then we can remember the
> > tuple and then if we get the SPECT_CONFIRM in the next stream we can
> > send that tuple?
> >
>
> Your suggestion makes sense to me.  So, we can try it.
Sure.
>
> > 2. During commit time in DecodeCommit we check whether we need to skip
> > the changes of the transaction or not by calling
> > SnapBuildXactNeedsSkip but since now we support streaming so it's
> > possible that before we decode the commit WAL, we might have already
> > sent the changes to the output plugin even though we could have
> > skipped those changes.  So my question is instead of checking at the
> > commit time can't we check before adding to ReorderBuffer itself
> >
>
> I think if we can do that then the same will be true for current code
> irrespective of this patch.  I think it is possible that we can't take
> that decision while decoding because we haven't assembled a consistent
> snapshot yet.  I think we might be able to do that while we try to
> stream the changes.  I think we need to take care of all the
> conditions during streaming (when the logical_decoding_workmem limit
> is reached) as we do in DecodeCommit.  This needs a bit more study.
I agree.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Dec 24, 2019 at 10:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Dec 12, 2019 at 3:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > I think the way invalidations work for logical replication is that
> > normally, we always start a new transaction before decoding each
> > commit which allows us to accept the invalidations (via
> > AtStart_Cache).  However, if there are catalog changes within the
> > transaction being decoded, we need to reflect those before trying to
> > decode the WAL of operation which happened after that catalog change.
> > As we are not logging the WAL for each invalidation, we need to
> > execute all the invalidation messages for this transaction at each
> > catalog change. We are able to do that now as we decode the entire WAL
> > for a transaction only once we get the commit's WAL which contains all
> > the invalidation messages.  So, we queue them up and execute them for
> > each catalog change which we identify by WAL record
> > XLOG_HEAP2_NEW_CID.
>
> Thanks for the explanation. That makes sense. But, it's still true,
> AFAICS, that instead of doing this stuff with logging invalidations
> you could just InvalidateSystemCaches() in the cases where you are
> currently applying all of the transaction's invalidations. That
> approach might be worse than changing the way invalidations are
> logged, but the two approaches deserve to be compared. One approach
> has more CPU overhead and the other has more WAL overhead, so it's a
> little hard to compare them, but it seems worth mulling over.
>

I have given some thought over it and it seems to me that this will
increase not only CPU usage but also Network usage.  The increase in
CPU usage will be for all WALSenders that decodes a transaction that
has performed DDL.  The increase in network usage comes from the fact
that we need to send the schema of relations again which doesn't
require to be invalidated.  It is because invalidation blew our local
map that remembers which relation schemas are sent.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> >
> > Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
> > needs to be in the RelationSyncEntry. In fact, I already have code for
> > that in my private repository - I thought the patches I sent here do
> > include this, but apparently I forgot to include this bit :-(
> >
> > Attached is a rebased patch series, fixing this. It's essentially v2
> > with a couple of patches (0003, 0008, 0009 and 0012) replacing the
> > is_schema_sent with correct handling.
> >
> >
> > 0003 - removes an is_schema_sent reference added prematurely (it's added
> > by a later patch, causing compile failure)
> >
> > 0008 - adds the is_schema_sent back (essentially reverting 0003)
> >
> > 0009 - removes is_schema_sent entirely
> >
> > 0012 - adds the correct handling of schema flags in pgoutput
> >

Thanks for splitting the changes.  They are quite clear.

> >
> > I don't know what other changes you've made since v2, so this way it
> > should be possible to just take 0003, 0008, 0009 and 0012 and slip them
> > in with minimal hassle.
> >
> > FWIW thanks to everyone (and Amit and Dilip in particular) working on
> > this patch series.  There's been a lot of great reviews and improvements
> > since I abandoned this thread for a while. I expect to be able to spend
> > more time working on this in January.
> >
> +static void
> +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
> +{
> + MemoryContext oldctx;
> +
> + oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> +
> + entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
> +
> + MemoryContextSwitchTo(oldctx);
> +}
> I was looking into the schema tracking solution and I have one
> question, Shouldn't we remove the topxid from the list if the
> (sub)transaction is aborted?  because once it is aborted we need to
> resent the schema.
>

I think you are right because, at abort, the subscriber would remove
the changes (for a subtransaction) including the schema changes sent
and then it won't be able to understand the subsequent changes sent by
the publisher.  Won't we need to remove xid from the list at commit
time as well, otherwise, the list will keep on growing.  One more
thing, we need to search the list of all the relations in the local
map to find xid being aborted/committed, right?  If so, won't it be
costly doing at each transaction abort/commit?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sat, Jan 4, 2020 at 10:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > +static void
> > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
> > +{
> > + MemoryContext oldctx;
> > +
> > + oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> > +
> > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
> > +
> > + MemoryContextSwitchTo(oldctx);
> > +}
> > I was looking into the schema tracking solution and I have one
> > question, Shouldn't we remove the topxid from the list if the
> > (sub)transaction is aborted?  because once it is aborted we need to
> > resent the schema.
> >
>
> I think you are right because, at abort, the subscriber would remove
> the changes (for a subtransaction) including the schema changes sent
> and then it won't be able to understand the subsequent changes sent by
> the publisher.  Won't we need to remove xid from the list at commit
> time as well, otherwise, the list will keep on growing.
Yes, we need to remove the xid from the list at the time of commit as well.

 One more
> thing, we need to search the list of all the relations in the local
> map to find xid being aborted/committed, right?  If so, won't it be
> costly doing at each transaction abort/commit?
Yeah, if multiple concurrent transactions operate on the common
relations then the list can grow longer.  I am not sure how many
concurrent large transactions are possible maybe it won't be huge that
searching will be very costly.  Otherwise, we can maintain the sorted
array of the xids and do a binary search or we can maintain hash?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> Yesterday, Tomas has posted the latest version of the patch set which
> contain the fix for schema send part.  Meanwhile, I was working on few
> review comments/bugfixes and refactoring.  I have tried to merge those
> changes with the latest patch set except the refactoring related to
> "0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
> has also made some changes in the same patch.
>

I don't see any changes by Tomas in that particular patch, am I
missing something?

>  I have created a
> separate patch for the same so that we can review the changes and then
> we can merge them to the main patch.
>

It is better to merge it with the main patch for
"Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
difficult to review.

> > > 0002-Issue-individual-invalidations-with-wal_level-log
> > > ----------------------------------------------------------------------------
> > > 1.
> > > xact_desc_invalidations(StringInfo buf,
> > > {
> > > ..
> > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
> > >
> > > You have removed logging for the above cache but forgot to remove its
> > > reference from one of the places.  Also, I think you need to add a
> > > comment somewhere in inval.c to say why you are writing for WAL for
> > > some types of invalidations and not for others?
> Done
>

I don't see any new comments as asked by me.  I think we should also
consider WAL logging at each command end instead of doing piecemeal as
discussed in another email [1], which will have lesser code changes
and maybe better in performance.  You might want to evaluate the
performance of both approaches.

> > >
> > > 0003-Extend-the-output-plugin-API-with-stream-methods
> > > --------------------------------------------------------------------------------
> > >
> > > 4.
> > > stream_start_cb_wrapper()
> > > {
> > > ..
> > > + /* state.report_location = apply_lsn; */
> > > ..
> > > + /* FIXME ctx->write_location = apply_lsn; */
> > > ..
> > > }
> > >
> > > See, if we can fix these and similar in the callback for the stop.  I
> > > think we don't have final_lsn till we commit/abort.  Can we compute
> > > before calling these API's?
> Done
>

You have just used final_lsn, but I don't see where you have ensured
that it is set before the API stream_stop_cb_wrapper.  I think we need
something similar to what Vignesh has done in one of his bug-fix patch
[2].  See my comment below in this regard.

> > >
> > >
> > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> > > ----------------------------------------------------------------------------------
> > > 1.
> > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > >   PG_CATCH();
> > >   {
> > >   /* TODO: Encapsulate cleanup
> > > from the PG_TRY and PG_CATCH blocks */
> > > +
> > >   if (iterstate)
> > >   ReorderBufferIterTXNFinish(rb, iterstate);
> > >
> > > Spurious line change.
> > >
> Done

+ /*
+ * We don't expect direct calls to heap_getnext with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(scan->rs_base.rs_rd) ||
+   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
+ elog(ERROR, "improper heap_getnext call");

Earlier, I thought we don't need to check if it is a regular table in
this check, but it is required because output plugins can try to do
that and if they do so during decoding (with historic snapshots), the
same should be not allowed.

How about changing the error message to "unexpected heap_getnext call
during logical decoding" or something like that?

> > > 2. The commit message of this patch refers to Prepared transactions.
> > > I think that needs to be changed.
> > >
> > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > -------------------------------------------------------------------------

Few comments on v4-0018-Review-comment-fix-and-refactoring:
1.
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?

2.
+ if (streaming)
+ {
+ /*
+ * Set the CheckXidAlive to the current (sub)xid for which this
+ * change belongs to so that we can detect the abort while we are
+ * decoding.
+ */
+ CheckXidAlive = change->txn->xid;
+
+ /* Increment the stream count. */
+ streamed++;
+ }

Is the variable 'streamed' used anywhere?

3.
+ /*
+ * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
+ * any memory. We could also keep the hash table and update it with
+ * new ctid values, but this seems simpler and good enough for now.
+ */
+ ReorderBufferDestroyTupleCidHash(rb, txn);

Won't this be required only when we are streaming changes?

As per my understanding apart from the above comments, the known
pending work for this patchset is as follows:
a. The two open items agreed to you in the email [3].
b. Complete the handling of schema_sent as discussed above [4].
c. Few comments by Vignesh and the response on the same by me [5][6].
d. WAL overhead and performance testing for additional WAL logging by
this patchset.
e. Some way to see the tuple for streamed transactions by decoding API
as speculated by you [7].

Have I missed anything?

[1] -
https://www.postgresql.org/message-id/CAA4eK1LOa%2B2KqNX%3Dm%3D1qMBDW%2Bo50AuwjAOX6ZqL-rWGiH1F9MQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CALDaNm3MDxFnsZsnSqVhPBLS3%3DqzNH6%2BYzB%3DxYuX2vbtsUeFgw%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com
[4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com
[5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com
[6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com
[7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > Yesterday, Tomas has posted the latest version of the patch set which
> > contain the fix for schema send part.  Meanwhile, I was working on few
> > review comments/bugfixes and refactoring.  I have tried to merge those
> > changes with the latest patch set except the refactoring related to
> > "0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
> > has also made some changes in the same patch.
> >
>
> I don't see any changes by Tomas in that particular patch, am I
> missing something?
He has created some sub-patch from the main patch for handling
schema-sent issue.  So if I make change in that patch all other
patches will conflict.

>
> >  I have created a
> > separate patch for the same so that we can review the changes and then
> > we can merge them to the main patch.
> >
>
> It is better to merge it with the main patch for
> "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
> difficult to review.
Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
(0007).  Basically, if we merge all of them then we don't need to deal
with the conflict.  I think Tomas has kept them separate so that we
can review the solution for the schema sent.  And, I kept 0018 as a
separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
In the next patch set, I will merge all of them to 0007.

>
> > > > 0002-Issue-individual-invalidations-with-wal_level-log
> > > > ----------------------------------------------------------------------------
> > > > 1.
> > > > xact_desc_invalidations(StringInfo buf,
> > > > {
> > > > ..
> > > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> > > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
> > > >
> > > > You have removed logging for the above cache but forgot to remove its
> > > > reference from one of the places.  Also, I think you need to add a
> > > > comment somewhere in inval.c to say why you are writing for WAL for
> > > > some types of invalidations and not for others?
> > Done
> >
>
> I don't see any new comments as asked by me.
Oh, I just fixed one part of the comment and overlooked the rest.  Will fix.
  I think we should also
> consider WAL logging at each command end instead of doing piecemeal as
> discussed in another email [1], which will have lesser code changes
> and maybe better in performance.  You might want to evaluate the
> performance of both approaches.
Ok
>
> > > >
> > > > 0003-Extend-the-output-plugin-API-with-stream-methods
> > > > --------------------------------------------------------------------------------
> > > >
> > > > 4.
> > > > stream_start_cb_wrapper()
> > > > {
> > > > ..
> > > > + /* state.report_location = apply_lsn; */
> > > > ..
> > > > + /* FIXME ctx->write_location = apply_lsn; */
> > > > ..
> > > > }
> > > >
> > > > See, if we can fix these and similar in the callback for the stop.  I
> > > > think we don't have final_lsn till we commit/abort.  Can we compute
> > > > before calling these API's?
> > Done
> >
>
> You have just used final_lsn, but I don't see where you have ensured
> that it is set before the API stream_stop_cb_wrapper.  I think we need
> something similar to what Vignesh has done in one of his bug-fix patch
> [2].  See my comment below in this regard.

You can refer below hunk in 0018.

+ /*
+ * Done with current changes, call stream_stop callback for streaming
+ * transaction, commit callback otherwise.
+ */
+ if (streaming)
+ {
+ /*
+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

>
> > > >
> > > >
> > > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> > > > ----------------------------------------------------------------------------------
> > > > 1.
> > > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > > >   PG_CATCH();
> > > >   {
> > > >   /* TODO: Encapsulate cleanup
> > > > from the PG_TRY and PG_CATCH blocks */
> > > > +
> > > >   if (iterstate)
> > > >   ReorderBufferIterTXNFinish(rb, iterstate);
> > > >
> > > > Spurious line change.
> > > >
> > Done
>
> + /*
> + * We don't expect direct calls to heap_getnext with valid
> + * CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(scan->rs_base.rs_rd) ||
> +   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
> + elog(ERROR, "improper heap_getnext call");
>
> Earlier, I thought we don't need to check if it is a regular table in
> this check, but it is required because output plugins can try to do
> that
I did not understand that, can you give some example?
and if they do so during decoding (with historic snapshots), the
> same should be not allowed.
>
> How about changing the error message to "unexpected heap_getnext call
> during logical decoding" or something like that?
Ok
>
> > > > 2. The commit message of this patch refers to Prepared transactions.
> > > > I think that needs to be changed.
> > > >
> > > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > > -------------------------------------------------------------------------
>
> Few comments on v4-0018-Review-comment-fix-and-refactoring:
> 1.
> + if (streaming)
> + {
> + /*
> + * Set the last last of the stream as the final lsn before calling
> + * stream stop.
> + */
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> + }
>
> Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?
Isn't it the same, there we are doing while serializing and here we
are doing while streaming?  Basically, the last LSN we streamed.  Am I
missing something?

>
> 2.
> + if (streaming)
> + {
> + /*
> + * Set the CheckXidAlive to the current (sub)xid for which this
> + * change belongs to so that we can detect the abort while we are
> + * decoding.
> + */
> + CheckXidAlive = change->txn->xid;
> +
> + /* Increment the stream count. */
> + streamed++;
> + }
>
> Is the variable 'streamed' used anywhere?
>
> 3.
> + /*
> + * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
> + * any memory. We could also keep the hash table and update it with
> + * new ctid values, but this seems simpler and good enough for now.
> + */
> + ReorderBufferDestroyTupleCidHash(rb, txn);
>
> Won't this be required only when we are streaming changes?

I will work on this review comments and reply to them separately along
with the patch.
>
> As per my understanding apart from the above comments, the known
> pending work for this patchset is as follows:
> a. The two open items agreed to you in the email [3].
> b. Complete the handling of schema_sent as discussed above [4].
> c. Few comments by Vignesh and the response on the same by me [5][6].
> d. WAL overhead and performance testing for additional WAL logging by
> this patchset.
> e. Some way to see the tuple for streamed transactions by decoding API
> as speculated by you [7].
>
> Have I missed anything?
I think this is the list I remember.  Apart from these few points by
Robert which are still under discussion[8].

>
> [1] -
https://www.postgresql.org/message-id/CAA4eK1LOa%2B2KqNX%3Dm%3D1qMBDW%2Bo50AuwjAOX6ZqL-rWGiH1F9MQ%40mail.gmail.com
> [2] -
https://www.postgresql.org/message-id/CALDaNm3MDxFnsZsnSqVhPBLS3%3DqzNH6%2BYzB%3DxYuX2vbtsUeFgw%40mail.gmail.com
> [3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com
> [4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com
> [5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com
> [6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com
> [7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com

[8] https://www.postgresql.org/message-id/CA%2BTgmoYH6N_YDvKH9AaAJo5ZTHn142K%3DB75VO9yKvjjjHcoZhA%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > It is better to merge it with the main patch for
> > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
> > difficult to review.
> Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
> (0007).  Basically, if we merge all of them then we don't need to deal
> with the conflict.  I think Tomas has kept them separate so that we
> can review the solution for the schema sent.  And, I kept 0018 as a
> separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
> In the next patch set, I will merge all of them to 0007.
>

Okay, I think we can merge those patches.

> >
> > + /*
> > + * We don't expect direct calls to heap_getnext with valid
> > + * CheckXidAlive for regular tables. Track that below.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > + !(IsCatalogRelation(scan->rs_base.rs_rd) ||
> > +   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
> > + elog(ERROR, "improper heap_getnext call");
> >
> > Earlier, I thought we don't need to check if it is a regular table in
> > this check, but it is required because output plugins can try to do
> > that
> I did not understand that, can you give some example?
>

I think it can lead to the same problem of concurrent aborts as for
catalog scans.

> >
> > > > > 2. The commit message of this patch refers to Prepared transactions.
> > > > > I think that needs to be changed.
> > > > >
> > > > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > > > -------------------------------------------------------------------------
> >
> > Few comments on v4-0018-Review-comment-fix-and-refactoring:
> > 1.
> > + if (streaming)
> > + {
> > + /*
> > + * Set the last last of the stream as the final lsn before calling
> > + * stream stop.
> > + */
> > + txn->final_lsn = prev_lsn;
> > + rb->stream_stop(rb, txn);
> > + }
> >
> > Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?
> Isn't it the same, there we are doing while serializing and here we
> are doing while streaming?  Basically, the last LSN we streamed.  Am I
> missing something?
>

No, I think you are right.

Few more comments:
--------------------------------
v4-0007-Implement-streaming-mode-in-ReorderBuffer
1.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
+ * information about
subtransactions, which could arrive after streaming start.
+ */
+ if (!txn->is_schema_sent)
+ snapshot_now
= ReorderBufferCopySnap(rb, txn->base_snapshot,
+ txn,
command_id);
..
}

Why are we using base snapshot here instead of the snapshot we saved
the first time streaming has happened?  And as mentioned in comments,
won't we need to consider the snapshots for subtransactions that
arrived after the last time we have streamed the changes?

2.
+ /* remember the command ID and snapshot for the streaming run */
+ txn->command_id = command_id;
+ txn-
>snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+
  txn, command_id);

I don't see where the txn->snapshot_now is getting freed.  The
base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
this getting freed.

3.
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * If this is a subxact, we need to stream the top-level transaction
+ * instead.
+ */
+ if (txn->toptxn)
+ {
+
ReorderBufferStreamTXN(rb, txn->toptxn);
+ return;
+ }

Is it ever possible that we reach here for subtransaction, if not,
then it should be Assert rather than if condition?

4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
especially to cover the case when it gets called due to memory
overflow (aka via ReorderBufferCheckMemoryLimit).

v4-0017-Extend-handling-of-concurrent-aborts-for-streamin
1.
@@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)
  if (using_subtxn)

RollbackAndReleaseCurrentSubTransaction();

- PG_RE_THROW();
+ /* re-throw only if it's not an abort */
+ if (errdata-
>sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+
}
+ else
+ {
+ /* remember the command ID and snapshot for the streaming run */
+ txn-
>command_id = command_id;
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+
  txn, command_id);
+ rb->stream_stop(rb, txn);
+
+
FlushErrorState();
+ }

Can you update comments either in the above code block or some other
place to explain what is the concurrent abort problem and how we dealt
with it?  Also, please explain how the above error handling is
sufficient to address all the various scenarios (sub-transaction got
aborted when we have already sent some changes, or when we have not
sent any changes yet).

v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
1.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));

Why here we can't use TransactionIdDidAbort?  If we can't use it, then
can you add comments stating the reason of the same.

2.
/*
+ * An xid value pointing to a possibly ongoing or a prepared transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;

In comments, there is a mention of a prepared transaction.  Do we
allow prepared transactions to be decoded as part of this patch?

3.
+ /*
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out
+ */
+ if (TransactionIdIsValid
(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))

This comment just says what code below is doing, can you explain the
rationale behind this check.  It would be better if it is clear by
reading comments, why we are doing this check after fetching the
tuple.  I think this can refer to the comment I suggested to add for
changes in patch
v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > It is better to merge it with the main patch for
> > > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
> > > difficult to review.
> > Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
> > (0007).  Basically, if we merge all of them then we don't need to deal
> > with the conflict.  I think Tomas has kept them separate so that we
> > can review the solution for the schema sent.  And, I kept 0018 as a
> > separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
> > In the next patch set, I will merge all of them to 0007.
> >
>
> Okay, I think we can merge those patches.
ok
>
> > >
> > > + /*
> > > + * We don't expect direct calls to heap_getnext with valid
> > > + * CheckXidAlive for regular tables. Track that below.
> > > + */
> > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > + !(IsCatalogRelation(scan->rs_base.rs_rd) ||
> > > +   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
> > > + elog(ERROR, "improper heap_getnext call");
> > >
> > > Earlier, I thought we don't need to check if it is a regular table in
> > > this check, but it is required because output plugins can try to do
> > > that
> > I did not understand that, can you give some example?
> >
>
> I think it can lead to the same problem of concurrent aborts as for
> catalog scans.
Yeah, got it.
>
> > >
> > > > > > 2. The commit message of this patch refers to Prepared transactions.
> > > > > > I think that needs to be changed.
> > > > > >
> > > > > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > > > > -------------------------------------------------------------------------
> > >
> > > Few comments on v4-0018-Review-comment-fix-and-refactoring:
> > > 1.
> > > + if (streaming)
> > > + {
> > > + /*
> > > + * Set the last last of the stream as the final lsn before calling
> > > + * stream stop.
> > > + */
> > > + txn->final_lsn = prev_lsn;
> > > + rb->stream_stop(rb, txn);
> > > + }
> > >
> > > Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?
> > Isn't it the same, there we are doing while serializing and here we
> > are doing while streaming?  Basically, the last LSN we streamed.  Am I
> > missing something?
> >
>
> No, I think you are right.
>
> Few more comments:
> --------------------------------
> v4-0007-Implement-streaming-mode-in-ReorderBuffer
> 1.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> + * information about
> subtransactions, which could arrive after streaming start.
> + */
> + if (!txn->is_schema_sent)
> + snapshot_now
> = ReorderBufferCopySnap(rb, txn->base_snapshot,
> + txn,
> command_id);
> ..
> }
>
> Why are we using base snapshot here instead of the snapshot we saved
> the first time streaming has happened?  And as mentioned in comments,
> won't we need to consider the snapshots for subtransactions that
> arrived after the last time we have streamed the changes?
>
> 2.
> + /* remember the command ID and snapshot for the streaming run */
> + txn->command_id = command_id;
> + txn-
> >snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +
>   txn, command_id);
>
> I don't see where the txn->snapshot_now is getting freed.  The
> base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
> this getting freed.
Ok, I will check that and fix.
>
> 3.
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * If this is a subxact, we need to stream the top-level transaction
> + * instead.
> + */
> + if (txn->toptxn)
> + {
> +
> ReorderBufferStreamTXN(rb, txn->toptxn);
> + return;
> + }
>
> Is it ever possible that we reach here for subtransaction, if not,
> then it should be Assert rather than if condition?

ReorderBufferCheckMemoryLimit, can call it either for the
subtransaction or for the main transaction, depends upon in which
ReorderBufferTXN you are adding the current change.

I will analyze your other comments and fix them in the next version.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 3.
> > +static void
> > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > {
> > ..
> > + /*
> > + * If this is a subxact, we need to stream the top-level transaction
> > + * instead.
> > + */
> > + if (txn->toptxn)
> > + {
> > +
> > ReorderBufferStreamTXN(rb, txn->toptxn);
> > + return;
> > + }
> >
> > Is it ever possible that we reach here for subtransaction, if not,
> > then it should be Assert rather than if condition?
>
> ReorderBufferCheckMemoryLimit, can call it either for the
> subtransaction or for the main transaction, depends upon in which
> ReorderBufferTXN you are adding the current change.
>

That function has code like below:

ReorderBufferCheckMemoryLimit()
{
..
if (ReorderBufferCanStream(rb))
{
/*
* Pick the largest toplevel transaction and evict it from memory by
* streaming the already decoded part.
*/
txn = ReorderBufferLargestTopTXN(rb);
/* we know there has to be one, because the size is not zero */
Assert(txn && !txn->toptxn);
..
ReorderBufferStreamTXN(rb, txn);
..
}

How can it ReorderBufferTXN pass for subtransaction?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jan 6, 2020 at 4:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 3.
> > > +static void
> > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > {
> > > ..
> > > + /*
> > > + * If this is a subxact, we need to stream the top-level transaction
> > > + * instead.
> > > + */
> > > + if (txn->toptxn)
> > > + {
> > > +
> > > ReorderBufferStreamTXN(rb, txn->toptxn);
> > > + return;
> > > + }
> > >
> > > Is it ever possible that we reach here for subtransaction, if not,
> > > then it should be Assert rather than if condition?
> >
> > ReorderBufferCheckMemoryLimit, can call it either for the
> > subtransaction or for the main transaction, depends upon in which
> > ReorderBufferTXN you are adding the current change.
> >
>
> That function has code like below:
>
> ReorderBufferCheckMemoryLimit()
> {
> ..
> if (ReorderBufferCanStream(rb))
> {
> /*
> * Pick the largest toplevel transaction and evict it from memory by
> * streaming the already decoded part.
> */
> txn = ReorderBufferLargestTopTXN(rb);
> /* we know there has to be one, because the size is not zero */
> Assert(txn && !txn->toptxn);
> ..
> ReorderBufferStreamTXN(rb, txn);
> ..
> }
>
> How can it ReorderBufferTXN pass for subtransaction?
>
Hmm, I missed it. You are right, will fix it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jan 6, 2020 at 4:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jan 6, 2020 at 4:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jan 6, 2020 at 3:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > 3.
> > > > +static void
> > > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > > {
> > > > ..
> > > > + /*
> > > > + * If this is a subxact, we need to stream the top-level transaction
> > > > + * instead.
> > > > + */
> > > > + if (txn->toptxn)
> > > > + {
> > > > +
> > > > ReorderBufferStreamTXN(rb, txn->toptxn);
> > > > + return;
> > > > + }
> > > >
> > > > Is it ever possible that we reach here for subtransaction, if not,
> > > > then it should be Assert rather than if condition?
> > >
> > > ReorderBufferCheckMemoryLimit, can call it either for the
> > > subtransaction or for the main transaction, depends upon in which
> > > ReorderBufferTXN you are adding the current change.
> > >
> >
> > That function has code like below:
> >
> > ReorderBufferCheckMemoryLimit()
> > {
> > ..
> > if (ReorderBufferCanStream(rb))
> > {
> > /*
> > * Pick the largest toplevel transaction and evict it from memory by
> > * streaming the already decoded part.
> > */
> > txn = ReorderBufferLargestTopTXN(rb);
> > /* we know there has to be one, because the size is not zero */
> > Assert(txn && !txn->toptxn);
> > ..
> > ReorderBufferStreamTXN(rb, txn);
> > ..
> > }
> >
> > How can it ReorderBufferTXN pass for subtransaction?
> >
> Hmm, I missed it. You are right, will fix it.
>
I have observed one more design issue.  The problem is that when we
get a toasted chunks we remember the changes in the memory(hash table)
but don't stream until we get the actual change on the main table.
Now, the problem is that we might get the change of the toasted table
and the main table in different streams.  So basically, in a stream,
if we have only got the toasted tuples then even after
ReorderBufferStreamTXN the memory usage will not be reduced.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have observed one more design issue.
>

Good observation.

>  The problem is that when we
> get a toasted chunks we remember the changes in the memory(hash table)
> but don't stream until we get the actual change on the main table.
> Now, the problem is that we might get the change of the toasted table
> and the main table in different streams.  So basically, in a stream,
> if we have only got the toasted tuples then even after
> ReorderBufferStreamTXN the memory usage will not be reduced.
>

I think we can't split such changes in a different stream (unless we
design an entirely new solution to send partial changes of toast
data), so we need to send them together. We can keep a flag like
data_complete in ReorderBufferTxn and mark it complete only when we
are able to assemble the entire tuple.  Now, whenever, we try to
stream the changes once we reach the memory threshold, we can check
whether the data_complete flag is true, if so, then only send the
changes, otherwise, we can pick the next largest transaction.  I think
we can retry it for few times and if we get the incomplete data for
multiple transactions, then we can decide to spill the transaction or
maybe we can directly spill the first largest transaction which has
incomplete data.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have observed one more design issue.
> >
>
> Good observation.
>
> >  The problem is that when we
> > get a toasted chunks we remember the changes in the memory(hash table)
> > but don't stream until we get the actual change on the main table.
> > Now, the problem is that we might get the change of the toasted table
> > and the main table in different streams.  So basically, in a stream,
> > if we have only got the toasted tuples then even after
> > ReorderBufferStreamTXN the memory usage will not be reduced.
> >
>
> I think we can't split such changes in a different stream (unless we
> design an entirely new solution to send partial changes of toast
> data), so we need to send them together. We can keep a flag like
> data_complete in ReorderBufferTxn and mark it complete only when we
> are able to assemble the entire tuple.  Now, whenever, we try to
> stream the changes once we reach the memory threshold, we can check
> whether the data_complete flag is true, if so, then only send the
> changes, otherwise, we can pick the next largest transaction.  I think
> we can retry it for few times and if we get the incomplete data for
> multiple transactions, then we can decide to spill the transaction or
> maybe we can directly spill the first largest transaction which has
> incomplete data.
>
Yeah, we might do something on this line.  Basically, we need to mark
the top-transaction as data-incomplete if any of its subtransaction is
having data-incomplete (it will always be the latest sub-transaction
of the top transaction).  Also, for streaming, we are checking the
largest top transaction whereas for spilling we just need the larget
(sub) transaction.   So we also need to decide while picking the
largest top transaction for streaming, if we get a few transactions
with in-complete data then how we will go for the spill.  Do we spill
all the sub-transactions under this top transaction or we will again
find the larget (sub) transaction for spilling.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have observed one more design issue.
> > >
> >
> > Good observation.
> >
> > >  The problem is that when we
> > > get a toasted chunks we remember the changes in the memory(hash table)
> > > but don't stream until we get the actual change on the main table.
> > > Now, the problem is that we might get the change of the toasted table
> > > and the main table in different streams.  So basically, in a stream,
> > > if we have only got the toasted tuples then even after
> > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > >
> >
> > I think we can't split such changes in a different stream (unless we
> > design an entirely new solution to send partial changes of toast
> > data), so we need to send them together. We can keep a flag like
> > data_complete in ReorderBufferTxn and mark it complete only when we
> > are able to assemble the entire tuple.  Now, whenever, we try to
> > stream the changes once we reach the memory threshold, we can check
> > whether the data_complete flag is true, if so, then only send the
> > changes, otherwise, we can pick the next largest transaction.  I think
> > we can retry it for few times and if we get the incomplete data for
> > multiple transactions, then we can decide to spill the transaction or
> > maybe we can directly spill the first largest transaction which has
> > incomplete data.
> >
> Yeah, we might do something on this line.  Basically, we need to mark
> the top-transaction as data-incomplete if any of its subtransaction is
> having data-incomplete (it will always be the latest sub-transaction
> of the top transaction).  Also, for streaming, we are checking the
> largest top transaction whereas for spilling we just need the larget
> (sub) transaction.   So we also need to decide while picking the
> largest top transaction for streaming, if we get a few transactions
> with in-complete data then how we will go for the spill.  Do we spill
> all the sub-transactions under this top transaction or we will again
> find the larget (sub) transaction for spilling.
>

I think it is better to do later as that will lead to the spill of
only required (minimum changes to get the memory below threshold)
changes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > I have observed one more design issue.
> > > >
> > >
> > > Good observation.
> > >
> > > >  The problem is that when we
> > > > get a toasted chunks we remember the changes in the memory(hash table)
> > > > but don't stream until we get the actual change on the main table.
> > > > Now, the problem is that we might get the change of the toasted table
> > > > and the main table in different streams.  So basically, in a stream,
> > > > if we have only got the toasted tuples then even after
> > > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > > >
> > >
> > > I think we can't split such changes in a different stream (unless we
> > > design an entirely new solution to send partial changes of toast
> > > data), so we need to send them together. We can keep a flag like
> > > data_complete in ReorderBufferTxn and mark it complete only when we
> > > are able to assemble the entire tuple.  Now, whenever, we try to
> > > stream the changes once we reach the memory threshold, we can check
> > > whether the data_complete flag is true, if so, then only send the
> > > changes, otherwise, we can pick the next largest transaction.  I think
> > > we can retry it for few times and if we get the incomplete data for
> > > multiple transactions, then we can decide to spill the transaction or
> > > maybe we can directly spill the first largest transaction which has
> > > incomplete data.
> > >
> > Yeah, we might do something on this line.  Basically, we need to mark
> > the top-transaction as data-incomplete if any of its subtransaction is
> > having data-incomplete (it will always be the latest sub-transaction
> > of the top transaction).  Also, for streaming, we are checking the
> > largest top transaction whereas for spilling we just need the larget
> > (sub) transaction.   So we also need to decide while picking the
> > largest top transaction for streaming, if we get a few transactions
> > with in-complete data then how we will go for the spill.  Do we spill
> > all the sub-transactions under this top transaction or we will again
> > find the larget (sub) transaction for spilling.
> >
>
> I think it is better to do later as that will lead to the spill of
> only required (minimum changes to get the memory below threshold)
> changes.
Make sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > It is better to merge it with the main patch for
> > > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
> > > difficult to review.
> > Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
> > (0007).  Basically, if we merge all of them then we don't need to deal
> > with the conflict.  I think Tomas has kept them separate so that we
> > can review the solution for the schema sent.  And, I kept 0018 as a
> > separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
> > In the next patch set, I will merge all of them to 0007.
> >
>
> Okay, I think we can merge those patches.
Done
0008, 0009, 0017, 0018 are merged to 0007, 0012 is merged to 0010

>
> Few more comments:
> --------------------------------
> v4-0007-Implement-streaming-mode-in-ReorderBuffer
> 1.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> + * information about
> subtransactions, which could arrive after streaming start.
> + */
> + if (!txn->is_schema_sent)
> + snapshot_now
> = ReorderBufferCopySnap(rb, txn->base_snapshot,
> + txn,
> command_id);
> ..
> }
>
> Why are we using base snapshot here instead of the snapshot we saved
> the first time streaming has happened?  And as mentioned in comments,
> won't we need to consider the snapshots for subtransactions that
> arrived after the last time we have streamed the changes?
Fixed
>
> 2.
> + /* remember the command ID and snapshot for the streaming run */
> + txn->command_id = command_id;
> + txn-
> >snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +
>   txn, command_id);
>
> I don't see where the txn->snapshot_now is getting freed.  The
> base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
> this getting freed.
I have freed this In ReorderBufferCleanupTXN
>
> 3.
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * If this is a subxact, we need to stream the top-level transaction
> + * instead.
> + */
> + if (txn->toptxn)
> + {
> +
> ReorderBufferStreamTXN(rb, txn->toptxn);
> + return;
> + }
>
> Is it ever possible that we reach here for subtransaction, if not,
> then it should be Assert rather than if condition?
Fixed
>
> 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
> fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
> especially to cover the case when it gets called due to memory
> overflow (aka via ReorderBufferCheckMemoryLimit).
We get origin_lsn during commit time so I am not sure how can we do
that.  I have also noticed that currently, we are not using origin_lsn
on the subscriber side.  I think need more investigation that if we
want this then do we need to log it early.

>
> v4-0017-Extend-handling-of-concurrent-aborts-for-streamin
> 1.
> @@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>   if (using_subtxn)
>
> RollbackAndReleaseCurrentSubTransaction();
>
> - PG_RE_THROW();
> + /* re-throw only if it's not an abort */
> + if (errdata-
> >sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
> + {
> + MemoryContextSwitchTo(ecxt);
> + PG_RE_THROW();
> +
> }
> + else
> + {
> + /* remember the command ID and snapshot for the streaming run */
> + txn-
> >command_id = command_id;
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +
>   txn, command_id);
> + rb->stream_stop(rb, txn);
> +
> +
> FlushErrorState();
> + }
>
> Can you update comments either in the above code block or some other
> place to explain what is the concurrent abort problem and how we dealt
> with it?  Also, please explain how the above error handling is
> sufficient to address all the various scenarios (sub-transaction got
> aborted when we have already sent some changes, or when we have not
> sent any changes yet).

Done
>
> v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
> 1.
> + /*
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> + * error out
> + */
> + if (TransactionIdIsValid(CheckXidAlive) &&
> + !TransactionIdIsInProgress(CheckXidAlive) &&
> + !TransactionIdDidCommit(CheckXidAlive))
> + ereport(ERROR,
> + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> + errmsg("transaction aborted during system catalog scan")));
>
> Why here we can't use TransactionIdDidAbort?  If we can't use it, then
> can you add comments stating the reason of the same.
Done
>
> 2.
> /*
> + * An xid value pointing to a possibly ongoing or a prepared transaction.
> + * Currently used in logical decoding.  It's possible that such transactions
> + * can get aborted while the decoding is ongoing.
> + */
> +TransactionId CheckXidAlive = InvalidTransactionId;
>
> In comments, there is a mention of a prepared transaction.  Do we
> allow prepared transactions to be decoded as part of this patch?
Fixed
>
> 3.
> + /*
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> + * error out
> + */
> + if (TransactionIdIsValid
> (CheckXidAlive) &&
> + !TransactionIdIsInProgress(CheckXidAlive) &&
> + !TransactionIdDidCommit(CheckXidAlive))
>
> This comment just says what code below is doing, can you explain the
> rationale behind this check.  It would be better if it is clear by
> reading comments, why we are doing this check after fetching the
> tuple.  I think this can refer to the comment I suggested to add for
> changes in patch
> v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.
Done


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 30, 2019 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> > > > 0002-Issue-individual-invalidations-with-wal_level-log
> > > > ----------------------------------------------------------------------------
> > > > 1.
> > > > xact_desc_invalidations(StringInfo buf,
> > > > {
> > > > ..
> > > > + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> > > > + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
> > > >
> > > > You have removed logging for the above cache but forgot to remove its
> > > > reference from one of the places.  Also, I think you need to add a
> > > > comment somewhere in inval.c to say why you are writing for WAL for
> > > > some types of invalidations and not for others?
> > Done
> >
>
> I don't see any new comments as asked by me.
Done

 I think we should also
> consider WAL logging at each command end instead of doing piecemeal as
> discussed in another email [1], which will have lesser code changes
> and maybe better in performance.  You might want to evaluate the
> performance of both approaches.

Still pending, will work on this.
>
> > > > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> > > > ----------------------------------------------------------------------------------
> > > > 1.
> > > > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > > >   PG_CATCH();
> > > >   {
> > > >   /* TODO: Encapsulate cleanup
> > > > from the PG_TRY and PG_CATCH blocks */
> > > > +
> > > >   if (iterstate)
> > > >   ReorderBufferIterTXNFinish(rb, iterstate);
> > > >
> > > > Spurious line change.
> > > >
> > Done
>
> + /*
> + * We don't expect direct calls to heap_getnext with valid
> + * CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(scan->rs_base.rs_rd) ||
> +   RelationIsUsedAsCatalogTable(scan->rs_base.rs_rd))))
> + elog(ERROR, "improper heap_getnext call");
>
> Earlier, I thought we don't need to check if it is a regular table in
> this check, but it is required because output plugins can try to do
> that and if they do so during decoding (with historic snapshots), the
> same should be not allowed.
>
> How about changing the error message to "unexpected heap_getnext call
> during logical decoding" or something like that?
Done
>
> > > > 2. The commit message of this patch refers to Prepared transactions.
> > > > I think that needs to be changed.
> > > >
> > > > 0006-Implement-streaming-mode-in-ReorderBuffer
> > > > -------------------------------------------------------------------------
>
> Few comments on v4-0018-Review-comment-fix-and-refactoring:
> 1.
> + if (streaming)
> + {
> + /*
> + * Set the last last of the stream as the final lsn before calling
> + * stream stop.
> + */
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> + }
>
> Shouldn't we try to final_lsn as is done by Vignesh's patch [2]?
Already agreed upon current implementation
>
> 2.
> + if (streaming)
> + {
> + /*
> + * Set the CheckXidAlive to the current (sub)xid for which this
> + * change belongs to so that we can detect the abort while we are
> + * decoding.
> + */
> + CheckXidAlive = change->txn->xid;
> +
> + /* Increment the stream count. */
> + streamed++;
> + }
>
> Is the variable 'streamed' used anywhere?
Removed
>
> 3.
> + /*
> + * Destroy the (relfilenode, ctid) hashtable, so that we don't leak
> + * any memory. We could also keep the hash table and update it with
> + * new ctid values, but this seems simpler and good enough for now.
> + */
> + ReorderBufferDestroyTupleCidHash(rb, txn);
>
> Won't this be required only when we are streaming changes?
Fixed
>
> As per my understanding apart from the above comments, the known
> pending work for this patchset is as follows:
> a. The two open items agreed to you in the email [3].
> b. Complete the handling of schema_sent as discussed above [4].
> c. Few comments by Vignesh and the response on the same by me [5][6].
> d. WAL overhead and performance testing for additional WAL logging by
> this patchset.
> e. Some way to see the tuple for streamed transactions by decoding API
> as speculated by you [7].
>
> Have I missed anything?
I have worked upon most of these items, I will reply to them separately.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Dec 29, 2019 at 1:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have observed some more issues
> >
> > 1. Currently, In ReorderBufferCommit, it is always expected that
> > whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
> > have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
> > SPEC_CONFIRM we send the tuple we got in SPECT_INSERT.  But, now those
> > two messages can be in different streams.  So we need to find a way to
> > handle this.  Maybe once we get SPEC_INSERT then we can remember the
> > tuple and then if we get the SPECT_CONFIRM in the next stream we can
> > send that tuple?
> >
>
> Your suggestion makes sense to me.  So, we can try it.

I have implemented this and attached it as a separate patch. In my
latest patch set[1]
>
> > 2. During commit time in DecodeCommit we check whether we need to skip
> > the changes of the transaction or not by calling
> > SnapBuildXactNeedsSkip but since now we support streaming so it's
> > possible that before we decode the commit WAL, we might have already
> > sent the changes to the output plugin even though we could have
> > skipped those changes.  So my question is instead of checking at the
> > commit time can't we check before adding to ReorderBuffer itself
> >
>
> I think if we can do that then the same will be true for current code
> irrespective of this patch.  I think it is possible that we can't take
> that decision while decoding because we haven't assembled a consistent
> snapshot yet.  I think we might be able to do that while we try to
> stream the changes.  I think we need to take care of all the
> conditions during streaming (when the logical_decoding_workmem limit
> is reached) as we do in DecodeCommit.  This needs a bit more study.

I have analyzed this further and I think we can not decide all the
conditions even while streaming.  Because IMHO once we get the
SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
so that if we get the commit of the transaction after we reach to the
SNAPBUILD_CONSISTENT.  However, if we get the commit before we reach
to SNAPBUILD_CONSISTENT then we need to ignore this transaction.  Now,
even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
which might get dropped later but that we can not decide while
streaming.

[1] https://www.postgresql.org/message-id/CAFiTN-snMb%3D53oqkM8av8Lqfxojjm4OBwCNxmFssgLCceY_zgg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alvaro Herrera
Дата:
I pushed 0005 (the rbtxn flags thing) after some light editing.
It's been around for long enough ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alvaro Herrera
Дата:
Here's a rebase of this patch series.  I didn't change anything except

1. disregard what was 0005, since I already pushed it.
2. roll 0003 into 0002.
3. rebase 0007 (now 0005) to account for the reorderbuffer changes.

(I did notice that 0005 adds a new boolean any_data_sent, which is
silly -- it should be another txn_flags bit.)

However, tests don't pass for me; notably, test_decoding crashes.
OTOH I noticed that the streamed transaction support in test_decoding
writes the XID to the output, which is going to make it useless for
regression testing.  It probably should not emit the numerical values.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alvaro Herrera
Дата:

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alvaro Herrera
Дата:
On 2020-Jan-10, Alvaro Herrera wrote:

> From 7d671806584fff71067c8bde38b2f642ba1331a9 Mon Sep 17 00:00:00 2001
> From: Dilip Kumar <dilip.kumar@enterprisedb.com>
> Date: Wed, 20 Nov 2019 16:41:13 +0530
> Subject: [PATCH v6 10/12] Enable streaming for all subscription TAP tests

This patch turns a lot of test into the streamed mode.  While it's
great that streaming mode is tested, we should add new tests for it
rather than failing to keep tests for the non-streamed mode.  I suggest
that we add two versions of each test, one for each mode.  Maybe the way
to do that is to create some subroutine that can be called twice.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Jan 8, 2020 at 1:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > I have observed one more design issue.
> > > >
> > >
> > > Good observation.
> > >
> > > >  The problem is that when we
> > > > get a toasted chunks we remember the changes in the memory(hash table)
> > > > but don't stream until we get the actual change on the main table.
> > > > Now, the problem is that we might get the change of the toasted table
> > > > and the main table in different streams.  So basically, in a stream,
> > > > if we have only got the toasted tuples then even after
> > > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > > >
> > >
> > > I think we can't split such changes in a different stream (unless we
> > > design an entirely new solution to send partial changes of toast
> > > data), so we need to send them together. We can keep a flag like
> > > data_complete in ReorderBufferTxn and mark it complete only when we
> > > are able to assemble the entire tuple.  Now, whenever, we try to
> > > stream the changes once we reach the memory threshold, we can check
> > > whether the data_complete flag is true, if so, then only send the
> > > changes, otherwise, we can pick the next largest transaction.  I think
> > > we can retry it for few times and if we get the incomplete data for
> > > multiple transactions, then we can decide to spill the transaction or
> > > maybe we can directly spill the first largest transaction which has
> > > incomplete data.
> > >
> > Yeah, we might do something on this line.  Basically, we need to mark
> > the top-transaction as data-incomplete if any of its subtransaction is
> > having data-incomplete (it will always be the latest sub-transaction
> > of the top transaction).  Also, for streaming, we are checking the
> > largest top transaction whereas for spilling we just need the larget
> > (sub) transaction.   So we also need to decide while picking the
> > largest top transaction for streaming, if we get a few transactions
> > with in-complete data then how we will go for the spill.  Do we spill
> > all the sub-transactions under this top transaction or we will again
> > find the larget (sub) transaction for spilling.
> >
>
> I think it is better to do later as that will lead to the spill of
> only required (minimum changes to get the memory below threshold)
> changes.
I think instead of doing this can't we just spill the changes which
are in toast_hash.  Basically, at the end of the stream, we have some
toast tuple which we could not stream because we did not have the
insert for the main table then we can spill only those changes which
are in tuple hash.  And, in the subsequent stream whenever we get the
insert for the main table at that time we can restore those changes
and stream.  We can also maintain some flag saying data is not
complete (with some change LSN number) and after that LSN we can spill
any toast change to disk until we get the change for the main table,
by this way we can avoid building tuple hash until we get the change
for the main table.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jan 13, 2020 at 3:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > >  The problem is that when we
> > > > > get a toasted chunks we remember the changes in the memory(hash table)
> > > > > but don't stream until we get the actual change on the main table.
> > > > > Now, the problem is that we might get the change of the toasted table
> > > > > and the main table in different streams.  So basically, in a stream,
> > > > > if we have only got the toasted tuples then even after
> > > > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > > > >
> > > >
> > > > I think we can't split such changes in a different stream (unless we
> > > > design an entirely new solution to send partial changes of toast
> > > > data), so we need to send them together. We can keep a flag like
> > > > data_complete in ReorderBufferTxn and mark it complete only when we
> > > > are able to assemble the entire tuple.  Now, whenever, we try to
> > > > stream the changes once we reach the memory threshold, we can check
> > > > whether the data_complete flag is true

Here, we can also consider streaming the changes when data_complete is
false, but some additional changes have been added to the same txn as
the new changes might complete the tuple.

> > > > , if so, then only send the
> > > > changes, otherwise, we can pick the next largest transaction.  I think
> > > > we can retry it for few times and if we get the incomplete data for
> > > > multiple transactions, then we can decide to spill the transaction or
> > > > maybe we can directly spill the first largest transaction which has
> > > > incomplete data.
> > > >
> > > Yeah, we might do something on this line.  Basically, we need to mark
> > > the top-transaction as data-incomplete if any of its subtransaction is
> > > having data-incomplete (it will always be the latest sub-transaction
> > > of the top transaction).  Also, for streaming, we are checking the
> > > largest top transaction whereas for spilling we just need the larget
> > > (sub) transaction.   So we also need to decide while picking the
> > > largest top transaction for streaming, if we get a few transactions
> > > with in-complete data then how we will go for the spill.  Do we spill
> > > all the sub-transactions under this top transaction or we will again
> > > find the larget (sub) transaction for spilling.
> > >
> >
> > I think it is better to do later as that will lead to the spill of
> > only required (minimum changes to get the memory below threshold)
> > changes.
> I think instead of doing this can't we just spill the changes which
> are in toast_hash.  Basically, at the end of the stream, we have some
> toast tuple which we could not stream because we did not have the
> insert for the main table then we can spill only those changes which
> are in tuple hash.
>

Hmm, I think this can turn out to be inefficient because we can easily
end up spilling the data even when we don't need to so.  Consider
cases, where part of the streamed changes are for toast, and remaining
are the changes which we would have streamed and hence can be removed.
In such cases, we could have easily consumed remaining changes for
toast without spilling.  Also, I am not sure if spilling changes from
the hash table is a good idea as they are no more in the same order as
they were in ReorderBuffer which means the order in which we serialize
the changes normally would change and that might have some impact, so
we would need some more study if we want to pursue this idea.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Tue, Jan 14, 2020 at 10:56:37AM +0530, Dilip Kumar wrote:
>On Sat, Jan 11, 2020 at 3:07 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> On 2020-Jan-10, Alvaro Herrera wrote:
>>
>> > Here's a rebase of this patch series.  I didn't change anything except
>>
>> ... this time with attachments ...
>The patch set fails to apply on the head so rebased. (Rebased on
>commit cebf9d6e6ee13cbf9f1a91ec633cf96780ffc985)
>

I've noticed the patch was in WoA state since 2019/12/01, but there's
been quite a lot of traffic on this thread and a bunch of new patch
versions. So I've switched it to "needs review" - if that's not the
right status, let me know.

Also, the patch was moved forward mostly by Amit and Dilip, so I've
added them as authors in the CF app (well, what matters is the commit
message, of course, but let's keep this up to date too).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 13, 2020 at 3:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > >  The problem is that when we
> > > > > > get a toasted chunks we remember the changes in the memory(hash table)
> > > > > > but don't stream until we get the actual change on the main table.
> > > > > > Now, the problem is that we might get the change of the toasted table
> > > > > > and the main table in different streams.  So basically, in a stream,
> > > > > > if we have only got the toasted tuples then even after
> > > > > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > > > > >
> > > > >
> > > > > I think we can't split such changes in a different stream (unless we
> > > > > design an entirely new solution to send partial changes of toast
> > > > > data), so we need to send them together. We can keep a flag like
> > > > > data_complete in ReorderBufferTxn and mark it complete only when we
> > > > > are able to assemble the entire tuple.  Now, whenever, we try to
> > > > > stream the changes once we reach the memory threshold, we can check
> > > > > whether the data_complete flag is true
>
> Here, we can also consider streaming the changes when data_complete is
> false, but some additional changes have been added to the same txn as
> the new changes might complete the tuple.
>
> > > > > , if so, then only send the
> > > > > changes, otherwise, we can pick the next largest transaction.  I think
> > > > > we can retry it for few times and if we get the incomplete data for
> > > > > multiple transactions, then we can decide to spill the transaction or
> > > > > maybe we can directly spill the first largest transaction which has
> > > > > incomplete data.
> > > > >
> > > > Yeah, we might do something on this line.  Basically, we need to mark
> > > > the top-transaction as data-incomplete if any of its subtransaction is
> > > > having data-incomplete (it will always be the latest sub-transaction
> > > > of the top transaction).  Also, for streaming, we are checking the
> > > > largest top transaction whereas for spilling we just need the larget
> > > > (sub) transaction.   So we also need to decide while picking the
> > > > largest top transaction for streaming, if we get a few transactions
> > > > with in-complete data then how we will go for the spill.  Do we spill
> > > > all the sub-transactions under this top transaction or we will again
> > > > find the larget (sub) transaction for spilling.
> > > >
> > >
> > > I think it is better to do later as that will lead to the spill of
> > > only required (minimum changes to get the memory below threshold)
> > > changes.
> > I think instead of doing this can't we just spill the changes which
> > are in toast_hash.  Basically, at the end of the stream, we have some
> > toast tuple which we could not stream because we did not have the
> > insert for the main table then we can spill only those changes which
> > are in tuple hash.
> >
>
> Hmm, I think this can turn out to be inefficient because we can easily
> end up spilling the data even when we don't need to so.  Consider
> cases, where part of the streamed changes are for toast, and remaining
> are the changes which we would have streamed and hence can be removed.
> In such cases, we could have easily consumed remaining changes for
> toast without spilling.  Also, I am not sure if spilling changes from
> the hash table is a good idea as they are no more in the same order as
> they were in ReorderBuffer which means the order in which we serialize
> the changes normally would change and that might have some impact, so
> we would need some more study if we want to pursue this idea.
I have fixed this bug and attached it as a separate patch.  I will
merge it to the main patch after we agree with the idea and after some
more testing.

The idea is that whenever we get the toasted chunk instead of directly
inserting it into the toast hash I am inserting it into some local
list so that if we don't get the change for the main table then we can
insert these changes back to the txn->changes.  So once we get the
change for the main table at that time I am preparing the hash table
to merge the chunks.  If the stream is over and we haven't got the
changes for the main table, that time we will mark the txn that it has
some pending toast changes so that next time we will not pick the same
transaction for the streaming.  This flag will be cleaned whenever we
get any changes for the txn (insert or /update).  There is also a
possibility that even after we stream the changes the rb->size is not
below logical_decoding_work_mem because we could not stream the
changes so for handling this after streaming we recheck the size and
if it is still not under control then we pick another transaction.  In
some cases, we might not get any transaction to stream because the
transaction has the pending toast change flag set, In this case, we
will go for the spill.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
Update on the open items
> As per my understanding apart from the above comments, the known
> pending work for this patchset is as follows:
> a. The two open items agreed to you in the email [3].  -> The first part is done and the second part is an
improvement,not a bugfix.  I will try to work on this part in the next patch set.
 
> b. Complete the handling of schema_sent as discussed above [4].  -> Done
> c. Few comments by Vignesh and the response on the same by me [5][6]. -> Done
> d. WAL overhead and performance testing for additional WAL logging by
> this patchset. -> Pending
> e. Some way to see the tuple for streamed transactions by decoding API
> as speculated by you [7]. ->Pending
f. Bug in the toast table handling -> Submitted as a separate POC
patch, which can be merged to the main after review and more testing.

> [3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com
> [4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com
> [5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com
> [6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com
> [7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Alvaro Herrera
Дата:
I looked at this patchset and it seemed natural to apply 0008 next
(adding work_mem to subscriptions).  Attached is Dilip's latest version,
plus my review changes.  This will break the patch tester's logic; sorry
about that.

What part of this change is what sets the process's
logical_decoding_work_mem to the given value?  I was unable to figure
that out.  Is it missing or am I just stupid?

Changes:
* the patch adds logical_decoding_work_mem SGML, but that has already
  been applied (cec2edfa7859); remove dupe.

* parse_subscription_options() comment says that it will raise an error if a
  caller does not pass the pointer for an option but option list
  specifies that option.  It does not really implement that behavior (an
  existing problem): instead, if the pointer is not passed, the option
  is ignored.  Moreover, this new patch continued to fail to handle
  things as the comment says.  I decided to implement the documented
  behavior instead; it's now inconsistent with how the other options are
  implemented.  I think we should fix the other options to behave as the
  comment says, because it's a more convenient API; if we instead opted
  to update the code comment to match the code, each caller would have
  to be checked to verify that the correct options are passed, which is
  pointless and error prone.

* the parse_subscription_options API is a mess.  I reordered the
  arguments a little bit; also change the argument layout in callers so
  that each caller is grouped more sensibly.  Also added comments to
  simplify reading the argument lists.  I think this could be fixed by
  using an ad-hoc struct to pass in and out.  Didn't get around to doing
  that, seems an unrelated potential improvement.

* trying to do own range checking in pgoutput and subscriptioncmds.c
  seems pointless and likely to get out of sync with guc.c.  Simpler is
  to call set_config_option() to verify that the argument is in range.
  (Note a further problem in the patch series: the range check in
  subscriptioncmds.c is only added in patch 0009).

* parsing integers using scanint8() seemed weird (error messages there
  do not correspond to what we want).  After a couple of false starts, I
  decided to rely on guc.c's set_config_option() followed by parse_int().
  That also has the benefit that you can give it units.

* psql \dRs+ should display the work_mem; patch failed to do that.
  Added.  Unit display is done by pg_size_pretty(), which might be
  different from what guc.c does, but I think it works OK.
  It's the first place where we use pg_size_pretty to show a memory
  limit, however.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Jan 22, 2020 at 10:07 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
> I looked at this patchset and it seemed natural to apply 0008 next
> (adding work_mem to subscriptions).
>

I am not so sure whether we need this patch as the exact scenario
where it can help is not very clear to me and neither did anyone
explained.  I have raised this concern earlier as well [1].  The point
is that 'logical_decoding_work_mem' applies to the entire
ReorderBuffer in the publisher's side and how will a parameter from a
particular subscription help in that?

[1] - https://www.postgresql.org/message-id/CAA4eK1J%2B3kab6RSZrgj0YiQV1r%2BH3FWVaNjKhWvpEe5-bpZiBw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Hmm, I think this can turn out to be inefficient because we can easily
> > end up spilling the data even when we don't need to so.  Consider
> > cases, where part of the streamed changes are for toast, and remaining
> > are the changes which we would have streamed and hence can be removed.
> > In such cases, we could have easily consumed remaining changes for
> > toast without spilling.  Also, I am not sure if spilling changes from
> > the hash table is a good idea as they are no more in the same order as
> > they were in ReorderBuffer which means the order in which we serialize
> > the changes normally would change and that might have some impact, so
> > we would need some more study if we want to pursue this idea.
> I have fixed this bug and attached it as a separate patch.  I will
> merge it to the main patch after we agree with the idea and after some
> more testing.
>
> The idea is that whenever we get the toasted chunk instead of directly
> inserting it into the toast hash I am inserting it into some local
> list so that if we don't get the change for the main table then we can
> insert these changes back to the txn->changes.  So once we get the
> change for the main table at that time I am preparing the hash table
> to merge the chunks.
>


I think this idea will work but appears to be quite costly because (a)
you might need to serialize/deserialize the changes multiple times and
might attempt streaming multiple times even though you can't do (b)
you need to remove/add the same set of changes from the main list
multiple times.

It seems to me that we need to add all of this new handling because
while taking the decision whether to stream or not we don't know
whether the txn has changes that can't be streamed.  One idea to make
it work is that we identify it while decoding the WAL.  I think we
need to set a bit in the insert/delete WAL record to identify if the
tuple belongs to a toast relation.  This won't add any additional
overhead in WAL and reduce a lot of complexity in the logical decoding
and also decoding will be efficient.  If this is feasible, then we can
do the same for speculative insertions.

In patch v8-0013-Bugfix-handling-of-incomplete-toast-tuple, why is
below change required?

--- a/contrib/test_decoding/logical.conf
+++ b/contrib/test_decoding/logical.conf
@@ -1,3 +1,4 @@
 wal_level = logical
 max_replication_slots = 4
 logical_decoding_work_mem = 64kB
+logging_collector=on


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Hmm, I think this can turn out to be inefficient because we can easily
> > > end up spilling the data even when we don't need to so.  Consider
> > > cases, where part of the streamed changes are for toast, and remaining
> > > are the changes which we would have streamed and hence can be removed.
> > > In such cases, we could have easily consumed remaining changes for
> > > toast without spilling.  Also, I am not sure if spilling changes from
> > > the hash table is a good idea as they are no more in the same order as
> > > they were in ReorderBuffer which means the order in which we serialize
> > > the changes normally would change and that might have some impact, so
> > > we would need some more study if we want to pursue this idea.
> > I have fixed this bug and attached it as a separate patch.  I will
> > merge it to the main patch after we agree with the idea and after some
> > more testing.
> >
> > The idea is that whenever we get the toasted chunk instead of directly
> > inserting it into the toast hash I am inserting it into some local
> > list so that if we don't get the change for the main table then we can
> > insert these changes back to the txn->changes.  So once we get the
> > change for the main table at that time I am preparing the hash table
> > to merge the chunks.
> >
>
>
> I think this idea will work but appears to be quite costly because (a)
> you might need to serialize/deserialize the changes multiple times and
> might attempt streaming multiple times even though you can't do (b)
> you need to remove/add the same set of changes from the main list
> multiple times.
I agree with this.
>
> It seems to me that we need to add all of this new handling because
> while taking the decision whether to stream or not we don't know
> whether the txn has changes that can't be streamed.  One idea to make
> it work is that we identify it while decoding the WAL.  I think we
> need to set a bit in the insert/delete WAL record to identify if the
> tuple belongs to a toast relation.  This won't add any additional
> overhead in WAL and reduce a lot of complexity in the logical decoding
> and also decoding will be efficient.  If this is feasible, then we can
> do the same for speculative insertions.
The Idea looks good to me.  I will work on this.

>
> In patch v8-0013-Bugfix-handling-of-incomplete-toast-tuple, why is
> below change required?
>
> --- a/contrib/test_decoding/logical.conf
> +++ b/contrib/test_decoding/logical.conf
> @@ -1,3 +1,4 @@
>  wal_level = logical
>  max_replication_slots = 4
>  logical_decoding_work_mem = 64kB
> +logging_collector=on
Sorry, these are some local changes which got included in the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > Hmm, I think this can turn out to be inefficient because we can easily
> > > > end up spilling the data even when we don't need to so.  Consider
> > > > cases, where part of the streamed changes are for toast, and remaining
> > > > are the changes which we would have streamed and hence can be removed.
> > > > In such cases, we could have easily consumed remaining changes for
> > > > toast without spilling.  Also, I am not sure if spilling changes from
> > > > the hash table is a good idea as they are no more in the same order as
> > > > they were in ReorderBuffer which means the order in which we serialize
> > > > the changes normally would change and that might have some impact, so
> > > > we would need some more study if we want to pursue this idea.
> > > I have fixed this bug and attached it as a separate patch.  I will
> > > merge it to the main patch after we agree with the idea and after some
> > > more testing.
> > >
> > > The idea is that whenever we get the toasted chunk instead of directly
> > > inserting it into the toast hash I am inserting it into some local
> > > list so that if we don't get the change for the main table then we can
> > > insert these changes back to the txn->changes.  So once we get the
> > > change for the main table at that time I am preparing the hash table
> > > to merge the chunks.
> > >
> >
> >
> > I think this idea will work but appears to be quite costly because (a)
> > you might need to serialize/deserialize the changes multiple times and
> > might attempt streaming multiple times even though you can't do (b)
> > you need to remove/add the same set of changes from the main list
> > multiple times.
> I agree with this.
> >
> > It seems to me that we need to add all of this new handling because
> > while taking the decision whether to stream or not we don't know
> > whether the txn has changes that can't be streamed.  One idea to make
> > it work is that we identify it while decoding the WAL.  I think we
> > need to set a bit in the insert/delete WAL record to identify if the
> > tuple belongs to a toast relation.  This won't add any additional
> > overhead in WAL and reduce a lot of complexity in the logical decoding
> > and also decoding will be efficient.  If this is feasible, then we can
> > do the same for speculative insertions.
> The Idea looks good to me.  I will work on this.
>

One more thing we can do is to identify whether the tuple belongs to
toast relation while decoding it.  However, I think to do that we need
to have access to relcache at that time and that might add some
overhead as we need to do that for each tuple.  Can we investigate
what it will take to do that and if it is better than setting a bit
during WAL logging.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Hmm, I think this can turn out to be inefficient because we can easily
> > > > > end up spilling the data even when we don't need to so.  Consider
> > > > > cases, where part of the streamed changes are for toast, and remaining
> > > > > are the changes which we would have streamed and hence can be removed.
> > > > > In such cases, we could have easily consumed remaining changes for
> > > > > toast without spilling.  Also, I am not sure if spilling changes from
> > > > > the hash table is a good idea as they are no more in the same order as
> > > > > they were in ReorderBuffer which means the order in which we serialize
> > > > > the changes normally would change and that might have some impact, so
> > > > > we would need some more study if we want to pursue this idea.
> > > > I have fixed this bug and attached it as a separate patch.  I will
> > > > merge it to the main patch after we agree with the idea and after some
> > > > more testing.
> > > >
> > > > The idea is that whenever we get the toasted chunk instead of directly
> > > > inserting it into the toast hash I am inserting it into some local
> > > > list so that if we don't get the change for the main table then we can
> > > > insert these changes back to the txn->changes.  So once we get the
> > > > change for the main table at that time I am preparing the hash table
> > > > to merge the chunks.
> > > >
> > >
> > >
> > > I think this idea will work but appears to be quite costly because (a)
> > > you might need to serialize/deserialize the changes multiple times and
> > > might attempt streaming multiple times even though you can't do (b)
> > > you need to remove/add the same set of changes from the main list
> > > multiple times.
> > I agree with this.
> > >
> > > It seems to me that we need to add all of this new handling because
> > > while taking the decision whether to stream or not we don't know
> > > whether the txn has changes that can't be streamed.  One idea to make
> > > it work is that we identify it while decoding the WAL.  I think we
> > > need to set a bit in the insert/delete WAL record to identify if the
> > > tuple belongs to a toast relation.  This won't add any additional
> > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > and also decoding will be efficient.  If this is feasible, then we can
> > > do the same for speculative insertions.
> > The Idea looks good to me.  I will work on this.
> >
>
> One more thing we can do is to identify whether the tuple belongs to
> toast relation while decoding it.  However, I think to do that we need
> to have access to relcache at that time and that might add some
> overhead as we need to do that for each tuple. Can we investigate
> what it will take to do that and if it is better than setting a bit
> during WAL logging.

IMHO, for the catalog scan, we will have to start/stop the transaction
for each change.  So do you want that we should evaluate its
performance?  Also, during we get the change we might not have the
complete historic snapshot ready to fetch the rel cache entry.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > > It seems to me that we need to add all of this new handling because
> > > > while taking the decision whether to stream or not we don't know
> > > > whether the txn has changes that can't be streamed.  One idea to make
> > > > it work is that we identify it while decoding the WAL.  I think we
> > > > need to set a bit in the insert/delete WAL record to identify if the
> > > > tuple belongs to a toast relation.  This won't add any additional
> > > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > > and also decoding will be efficient.  If this is feasible, then we can
> > > > do the same for speculative insertions.
> > > The Idea looks good to me.  I will work on this.
> > >
> >
> > One more thing we can do is to identify whether the tuple belongs to
> > toast relation while decoding it.  However, I think to do that we need
> > to have access to relcache at that time and that might add some
> > overhead as we need to do that for each tuple. Can we investigate
> > what it will take to do that and if it is better than setting a bit
> > during WAL logging.
>
> IMHO, for the catalog scan, we will have to start/stop the transaction
> for each change.  So do you want that we should evaluate its
> performance?
>

No, I was not thinking about each change, but at the level of ReorderBufferTXN.

>  Also, during we get the change we might not have the
> complete historic snapshot ready to fetch the rel cache entry.
>

Before decoding each change (say DecodeInsert), we call
SnapBuildProcessChange.  Isn't that sufficient?

Even, if the above is possible, I am not sure how good is it for each
change we fetch rel cache entry, that is the point I was worried.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jan 28, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > > It seems to me that we need to add all of this new handling because
> > > > > while taking the decision whether to stream or not we don't know
> > > > > whether the txn has changes that can't be streamed.  One idea to make
> > > > > it work is that we identify it while decoding the WAL.  I think we
> > > > > need to set a bit in the insert/delete WAL record to identify if the
> > > > > tuple belongs to a toast relation.  This won't add any additional
> > > > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > > > and also decoding will be efficient.  If this is feasible, then we can
> > > > > do the same for speculative insertions.
> > > > The Idea looks good to me.  I will work on this.
> > > >
> > >
> > > One more thing we can do is to identify whether the tuple belongs to
> > > toast relation while decoding it.  However, I think to do that we need
> > > to have access to relcache at that time and that might add some
> > > overhead as we need to do that for each tuple. Can we investigate
> > > what it will take to do that and if it is better than setting a bit
> > > during WAL logging.
> >
> > IMHO, for the catalog scan, we will have to start/stop the transaction
> > for each change.  So do you want that we should evaluate its
> > performance?
> >
>
> No, I was not thinking about each change, but at the level of ReorderBufferTXN.
That means we will have to keep that transaction open until we decode
the commit WAL for that ReorderBufferTXN or you have anything else in
mind?

>
> >  Also, during we get the change we might not have the
> > complete historic snapshot ready to fetch the rel cache entry.
> >
>
> Before decoding each change (say DecodeInsert), we call
> SnapBuildProcessChange.  Isn't that sufficient?
Yeah, Right, we can get some recache entry based on the base snapshot.
And, that might be sufficient to know whether it's a toast relation or
not.
>
> Even, if the above is possible, I am not sure how good is it for each
> change we fetch rel cache entry, that is the point I was worried.

We might not need to scan the catalog every time, we might get it from
the cache itself.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Jan 28, 2020 at 1:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jan 28, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:58 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > > It seems to me that we need to add all of this new handling because
> > > > > > while taking the decision whether to stream or not we don't know
> > > > > > whether the txn has changes that can't be streamed.  One idea to make
> > > > > > it work is that we identify it while decoding the WAL.  I think we
> > > > > > need to set a bit in the insert/delete WAL record to identify if the
> > > > > > tuple belongs to a toast relation.  This won't add any additional
> > > > > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > > > > and also decoding will be efficient.  If this is feasible, then we can
> > > > > > do the same for speculative insertions.
> > > > > The Idea looks good to me.  I will work on this.
> > > > >
> > > >
> > > > One more thing we can do is to identify whether the tuple belongs to
> > > > toast relation while decoding it.  However, I think to do that we need
> > > > to have access to relcache at that time and that might add some
> > > > overhead as we need to do that for each tuple. Can we investigate
> > > > what it will take to do that and if it is better than setting a bit
> > > > during WAL logging.
> > >
> > > IMHO, for the catalog scan, we will have to start/stop the transaction
> > > for each change.  So do you want that we should evaluate its
> > > performance?
> > >
> >
> > No, I was not thinking about each change, but at the level of ReorderBufferTXN.
> That means we will have to keep that transaction open until we decode
> the commit WAL for that ReorderBufferTXN or you have anything else in
> mind?
>

or probably till we start streaming.

> >
> > >  Also, during we get the change we might not have the
> > > complete historic snapshot ready to fetch the rel cache entry.
> > >
> >
> > Before decoding each change (say DecodeInsert), we call
> > SnapBuildProcessChange.  Isn't that sufficient?
> Yeah, Right, we can get some recache entry based on the base snapshot.
> And, that might be sufficient to know whether it's a toast relation or
> not.
> >
> > Even, if the above is possible, I am not sure how good is it for each
> > change we fetch rel cache entry, that is the point I was worried.
>
> We might not need to scan the catalog every time, we might get it from
> the cache itself.
>

Right, but I am not completely sure if that is better than setting a
bit in WAL record for toast tuples.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> >
> > Few more comments:
> > --------------------------------
> > v4-0007-Implement-streaming-mode-in-ReorderBuffer
> > 1.
> > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > {
> > ..
> > + /*
> > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> > + * information about
> > subtransactions, which could arrive after streaming start.
> > + */
> > + if (!txn->is_schema_sent)
> > + snapshot_now
> > = ReorderBufferCopySnap(rb, txn->base_snapshot,
> > + txn,
> > command_id);
> > ..
> > }
> >
> > Why are we using base snapshot here instead of the snapshot we saved
> > the first time streaming has happened?  And as mentioned in comments,
> > won't we need to consider the snapshots for subtransactions that
> > arrived after the last time we have streamed the changes?
> Fixed
>

+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
+ /*
+ * We can not use txn->snapshot_now directly because after we there
+ * might be some new sub-transaction which after the last streaming run
+ * so we need to add those sub-xip in the snapshot.
+ */
+ snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
+ txn, command_id);

"because after we there", you seem to forget a word between 'we' and
'there'.  So as we are copying it now, does this mean it will consider
the snapshots for subtransactions that arrived after the last time we
have streamed the changes? If so, have you tested it and can we add
the same in comments.

Also, if we need to copy the snapshot here, then do we need to again
copy it in ReorderBufferProcessTXN(in below code and in catch block in
the same function).

{
..
+ /*
+ * Remember the command ID and snapshot if transaction is streaming
+ * otherwise free the snapshot if we have copied it.
+ */
+ if (streaming)
+ {
+ txn->command_id = command_id;
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+   txn, command_id);
+ }
+ else if (snapshot_now->copied)
+ ReorderBufferFreeSnap(rb, snapshot_now);
..
}

> >
> > 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
> > fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
> > especially to cover the case when it gets called due to memory
> > overflow (aka via ReorderBufferCheckMemoryLimit).
> We get origin_lsn during commit time so I am not sure how can we do
> that.  I have also noticed that currently, we are not using origin_lsn
> on the subscriber side.  I think need more investigation that if we
> want this then do we need to log it early.
>

Have you done any investigation of this point?  You might want to look
at pg_replication_origin* APIs.  Today, again looking at this code, I
think with current coding, it won't be used even when we encounter
commit record.  Because ReorderBufferCommit calls
ReorderBufferStreamCommit which will make sure that origin_id and
origin_lsn is never sent.  I think at least that should be fixed, if
not, probably, we need a comment with reasoning why we think it is
okay not to do in this case.

+ /*
+ * If we are streaming the in-progress transaction then Discard the

/Discard/discard

> >
> > v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
> > 1.
> > + /*
> > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> > + * error out
> > + */
> > + if (TransactionIdIsValid(CheckXidAlive) &&
> > + !TransactionIdIsInProgress(CheckXidAlive) &&
> > + !TransactionIdDidCommit(CheckXidAlive))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> > + errmsg("transaction aborted during system catalog scan")));
> >
> > Why here we can't use TransactionIdDidAbort?  If we can't use it, then
> > can you add comments stating the reason of the same.
> Done

+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we
+ * error out.  Instead of directly checking the abort status we do check
+ * if it is not in progress transaction and no committed. Because if there
+ * were a system crash then status of the the transaction which were running
+ * at that time might not have marked.  So we need to consider them as
+ * aborted.  Refer detailed comments at snapmgr.c where the variable is
+ * declared.


How about replacing the above comment with below one:

If CheckXidAlive is valid, then we check if it aborted. If it did, we
error out.  We can't directly use TransactionIdDidAbort as after crash
such transaction might not have been marked as aborted.  See detailed
comments at snapmgr.c where the variable is declared.


I am not able to understand the change in
v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.  Do you have
any explanation for the same?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > >
> > > Few more comments:
> > > --------------------------------
> > > v4-0007-Implement-streaming-mode-in-ReorderBuffer
> > > 1.
> > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > {
> > > ..
> > > + /*
> > > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> > > + * information about
> > > subtransactions, which could arrive after streaming start.
> > > + */
> > > + if (!txn->is_schema_sent)
> > > + snapshot_now
> > > = ReorderBufferCopySnap(rb, txn->base_snapshot,
> > > + txn,
> > > command_id);
> > > ..
> > > }
> > >
> > > Why are we using base snapshot here instead of the snapshot we saved
> > > the first time streaming has happened?  And as mentioned in comments,
> > > won't we need to consider the snapshots for subtransactions that
> > > arrived after the last time we have streamed the changes?
> > Fixed
> >
>
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * We can not use txn->snapshot_now directly because after we there
> + * might be some new sub-transaction which after the last streaming run
> + * so we need to add those sub-xip in the snapshot.
> + */
> + snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
> + txn, command_id);
>
> "because after we there", you seem to forget a word between 'we' and
> 'there'.  So as we are copying it now, does this mean it will consider
> the snapshots for subtransactions that arrived after the last time we
> have streamed the changes? If so, have you tested it and can we add
> the same in comments.

Ok
> Also, if we need to copy the snapshot here, then do we need to again
> copy it in ReorderBufferProcessTXN(in below code and in catch block in
> the same function).
I think so because as part of the
"REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
point to the snapshot and that will get truncated when we truncate all
the changes of the ReorderBufferTXN.   So I think we can check if
snapshot_now->copied is true then we can avoid copying otherwise we
can copy?

Other comments look fine to me so I will reply to them along with the
next version of the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Also, if we need to copy the snapshot here, then do we need to again
> > copy it in ReorderBufferProcessTXN(in below code and in catch block in
> > the same function).
> I think so because as part of the
> "REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
> point to the snapshot and that will get truncated when we truncate all
> the changes of the ReorderBufferTXN.   So I think we can check if
> snapshot_now->copied is true then we can avoid copying otherwise we
> can copy?
>

Yeah, that makes sense, but I think then we also need to ensure that
ReorderBufferStreamTXN frees the snapshot only when it is copied.  It
seems to me it should be always copied in the place where we are
trying to free it, so probably we should have an Assert there.

One more thing:
ReorderBufferProcessTXN()
{
..
+ if (streaming)
+ {
+ /*
+ * While streaming an in-progress transaction there is a
+ * possibility that the (sub)transaction might get aborted
+ * concurrently.  In such case if the (sub)transaction has
+ * catalog update then we might decode the tuple using wrong
+ * catalog version.  So for detecting the concurrent abort we
+ * set CheckXidAlive to the current (sub)transaction's xid for
+ * which this change belongs to.  And, during catalog scan we
+ * can check the status of the xid and if it is aborted we will
+ * report an specific error which we can ignore.  We might have
+ * already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the
+ * abort we will stream abort message to truncate the changes in
+ * the subscriber.
+ */
+ CheckXidAlive = change->txn->xid;
+ }
..
}

I think it is better to move the above code into an inline function
(something like SetXidAlive).  It will make the code in function
ReorderBufferProcessTXN look cleaner and easier to understand.

> Other comments look fine to me so I will reply to them along with the
> next version of the patch.
>

Okay, thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Other comments look fine to me so I will reply to them along with the
> next version of the patch.
>

This still needs more work, so I have moved this to the next CF.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Jan 10, 2020 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > > 2. During commit time in DecodeCommit we check whether we need to skip
> > > the changes of the transaction or not by calling
> > > SnapBuildXactNeedsSkip but since now we support streaming so it's
> > > possible that before we decode the commit WAL, we might have already
> > > sent the changes to the output plugin even though we could have
> > > skipped those changes.  So my question is instead of checking at the
> > > commit time can't we check before adding to ReorderBuffer itself
> > >
> >
> > I think if we can do that then the same will be true for current code
> > irrespective of this patch.  I think it is possible that we can't take
> > that decision while decoding because we haven't assembled a consistent
> > snapshot yet.  I think we might be able to do that while we try to
> > stream the changes.  I think we need to take care of all the
> > conditions during streaming (when the logical_decoding_workmem limit
> > is reached) as we do in DecodeCommit.  This needs a bit more study.
>
> I have analyzed this further and I think we can not decide all the
> conditions even while streaming.  Because IMHO once we get the
> SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
> so that if we get the commit of the transaction after we reach to the
> SNAPBUILD_CONSISTENT.  However, if we get the commit before we reach
> to SNAPBUILD_CONSISTENT then we need to ignore this transaction.  Now,
> even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
> which might get dropped later but that we can not decide while
> streaming.
>

This makes sense to me, but we should add a comment for the same when
we are streaming to say we can't skip similar to how we do during
commit time because of the above reason described by you.  Also, what
about other conditions where we can skip the transaction, basically
cases like (a) when the transaction happened in another database,  (b)
when the output plugin is not interested in the origin and (c) when we
are doing fast-forwarding

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Feb 3, 2020 at 9:51 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 10, 2020 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > > 2. During commit time in DecodeCommit we check whether we need to skip
> > > > the changes of the transaction or not by calling
> > > > SnapBuildXactNeedsSkip but since now we support streaming so it's
> > > > possible that before we decode the commit WAL, we might have already
> > > > sent the changes to the output plugin even though we could have
> > > > skipped those changes.  So my question is instead of checking at the
> > > > commit time can't we check before adding to ReorderBuffer itself
> > > >
> > >
> > > I think if we can do that then the same will be true for current code
> > > irrespective of this patch.  I think it is possible that we can't take
> > > that decision while decoding because we haven't assembled a consistent
> > > snapshot yet.  I think we might be able to do that while we try to
> > > stream the changes.  I think we need to take care of all the
> > > conditions during streaming (when the logical_decoding_workmem limit
> > > is reached) as we do in DecodeCommit.  This needs a bit more study.
> >
> > I have analyzed this further and I think we can not decide all the
> > conditions even while streaming.  Because IMHO once we get the
> > SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
> > so that if we get the commit of the transaction after we reach to the
> > SNAPBUILD_CONSISTENT.  However, if we get the commit before we reach
> > to SNAPBUILD_CONSISTENT then we need to ignore this transaction.  Now,
> > even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
> > which might get dropped later but that we can not decide while
> > streaming.
> >
>
> This makes sense to me, but we should add a comment for the same when
> we are streaming to say we can't skip similar to how we do during
> commit time because of the above reason described by you.  Also, what
> about other conditions where we can skip the transaction, basically
> cases like (a) when the transaction happened in another database,  (b)
> when the output plugin is not interested in the origin and (c) when we
> are doing fast-forwarding
I will analyze those and fix in my next version of the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Hmm, I think this can turn out to be inefficient because we can easily
> > > > > end up spilling the data even when we don't need to so.  Consider
> > > > > cases, where part of the streamed changes are for toast, and remaining
> > > > > are the changes which we would have streamed and hence can be removed.
> > > > > In such cases, we could have easily consumed remaining changes for
> > > > > toast without spilling.  Also, I am not sure if spilling changes from
> > > > > the hash table is a good idea as they are no more in the same order as
> > > > > they were in ReorderBuffer which means the order in which we serialize
> > > > > the changes normally would change and that might have some impact, so
> > > > > we would need some more study if we want to pursue this idea.
> > > > I have fixed this bug and attached it as a separate patch.  I will
> > > > merge it to the main patch after we agree with the idea and after some
> > > > more testing.
> > > >
> > > > The idea is that whenever we get the toasted chunk instead of directly
> > > > inserting it into the toast hash I am inserting it into some local
> > > > list so that if we don't get the change for the main table then we can
> > > > insert these changes back to the txn->changes.  So once we get the
> > > > change for the main table at that time I am preparing the hash table
> > > > to merge the chunks.
> > > >
> > >
> > >
> > > I think this idea will work but appears to be quite costly because (a)
> > > you might need to serialize/deserialize the changes multiple times and
> > > might attempt streaming multiple times even though you can't do (b)
> > > you need to remove/add the same set of changes from the main list
> > > multiple times.
> > I agree with this.
> > >
> > > It seems to me that we need to add all of this new handling because
> > > while taking the decision whether to stream or not we don't know
> > > whether the txn has changes that can't be streamed.  One idea to make
> > > it work is that we identify it while decoding the WAL.  I think we
> > > need to set a bit in the insert/delete WAL record to identify if the
> > > tuple belongs to a toast relation.  This won't add any additional
> > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > and also decoding will be efficient.  If this is feasible, then we can
> > > do the same for speculative insertions.
> > The Idea looks good to me.  I will work on this.
> >
>
> One more thing we can do is to identify whether the tuple belongs to
> toast relation while decoding it.  However, I think to do that we need
> to have access to relcache at that time and that might add some
> overhead as we need to do that for each tuple.  Can we investigate
> what it will take to do that and if it is better than setting a bit
> during WAL logging.
>
I have done some more analysis on this and it appears that there are
few problems in doing this.  Basically, once we get the confirmed
flush location, we advance the replication_slot_catalog_xmin so that
vacuum can garbage collect the old tuple.  So the problem is that
while we are collecting the changes in the ReorderBuffer our catalog
version might have removed,  and we might not find any relation entry
with that relfilenodeid (because it is dropped or altered in the
future).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > One more thing we can do is to identify whether the tuple belongs to
> > toast relation while decoding it.  However, I think to do that we need
> > to have access to relcache at that time and that might add some
> > overhead as we need to do that for each tuple.  Can we investigate
> > what it will take to do that and if it is better than setting a bit
> > during WAL logging.
> >
> I have done some more analysis on this and it appears that there are
> few problems in doing this.  Basically, once we get the confirmed
> flush location, we advance the replication_slot_catalog_xmin so that
> vacuum can garbage collect the old tuple.  So the problem is that
> while we are collecting the changes in the ReorderBuffer our catalog
> version might have removed,  and we might not find any relation entry
> with that relfilenodeid (because it is dropped or altered in the
> future).
>

Hmm, this means this can also occur while streaming the changes.  The
main reason as I understand is that it is because before decoding
commit, we don't know whether these changes are already sent to the
subscriber (based on confirmed_flush_location/start_decoding_at).  I
think it is better to skip streaming such transactions as we can't
make the right decision about these and as this can happen generally
after the crash for the first few transactions, it shouldn't matter
much if we serialize such transactions instead of streaming them.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > >
> > > Few more comments:
> > > --------------------------------
> > > v4-0007-Implement-streaming-mode-in-ReorderBuffer
> > > 1.
> > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > {
> > > ..
> > > + /*
> > > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> > > + * information about
> > > subtransactions, which could arrive after streaming start.
> > > + */
> > > + if (!txn->is_schema_sent)
> > > + snapshot_now
> > > = ReorderBufferCopySnap(rb, txn->base_snapshot,
> > > + txn,
> > > command_id);
> > > ..
> > > }
> > >
> > > Why are we using base snapshot here instead of the snapshot we saved
> > > the first time streaming has happened?  And as mentioned in comments,
> > > won't we need to consider the snapshots for subtransactions that
> > > arrived after the last time we have streamed the changes?
> > Fixed
> >
>
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * We can not use txn->snapshot_now directly because after we there
> + * might be some new sub-transaction which after the last streaming run
> + * so we need to add those sub-xip in the snapshot.
> + */
> + snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
> + txn, command_id);
>
> "because after we there", you seem to forget a word between 'we' and
> 'there'.
Fixed

  So as we are copying it now, does this mean it will consider
> the snapshots for subtransactions that arrived after the last time we
> have streamed the changes? If so, have you tested it and can we add
> the same in comments.
Yes I have tested.  Comment added.
>
> Also, if we need to copy the snapshot here, then do we need to again
> copy it in ReorderBufferProcessTXN(in below code and in catch block in
> the same function).
>
> {
> ..
> + /*
> + * Remember the command ID and snapshot if transaction is streaming
> + * otherwise free the snapshot if we have copied it.
> + */
> + if (streaming)
> + {
> + txn->command_id = command_id;
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +   txn, command_id);
> + }
> + else if (snapshot_now->copied)
> + ReorderBufferFreeSnap(rb, snapshot_now);
> ..
> }
>
Fixed
> > >
> > > 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
> > > fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
> > > especially to cover the case when it gets called due to memory
> > > overflow (aka via ReorderBufferCheckMemoryLimit).
> > We get origin_lsn during commit time so I am not sure how can we do
> > that.  I have also noticed that currently, we are not using origin_lsn
> > on the subscriber side.  I think need more investigation that if we
> > want this then do we need to log it early.
> >
>
> Have you done any investigation of this point?  You might want to look
> at pg_replication_origin* APIs.  Today, again looking at this code, I
> think with current coding, it won't be used even when we encounter
> commit record.  Because ReorderBufferCommit calls
> ReorderBufferStreamCommit which will make sure that origin_id and
> origin_lsn is never sent.  I think at least that should be fixed, if
> not, probably, we need a comment with reasoning why we think it is
> okay not to do in this case.
Still, the problem is the same because, currently, we are sending
origin_lsn as part of the "pgoutput_begin" message.  Now, for the
streaming transaction,
we have already sent the stream start.  However, we might send this
during the stream commit, but I am not completely sure because
currently,
the consumer of this message "apply_handle_origin" is just ignoring
it.  I have also looked into pg_replication_origin* APIs and they are
used for setting origin id and
tracking the progress, but they will not consume the origin_lsn we are
sending in pgoutput_begin so this is not directly related.

>
> + /*
> + * If we are streaming the in-progress transaction then Discard the
>
> /Discard/discard
Done
>
> > >
> > > v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
> > > 1.
> > > + /*
> > > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> > > + * error out
> > > + */
> > > + if (TransactionIdIsValid(CheckXidAlive) &&
> > > + !TransactionIdIsInProgress(CheckXidAlive) &&
> > > + !TransactionIdDidCommit(CheckXidAlive))
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> > > + errmsg("transaction aborted during system catalog scan")));
> > >
> > > Why here we can't use TransactionIdDidAbort?  If we can't use it, then
> > > can you add comments stating the reason of the same.
> > Done
>
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> + * error out.  Instead of directly checking the abort status we do check
> + * if it is not in progress transaction and no committed. Because if there
> + * were a system crash then status of the the transaction which were running
> + * at that time might not have marked.  So we need to consider them as
> + * aborted.  Refer detailed comments at snapmgr.c where the variable is
> + * declared.
>
>
> How about replacing the above comment with below one:
>
> If CheckXidAlive is valid, then we check if it aborted. If it did, we
> error out.  We can't directly use TransactionIdDidAbort as after crash
> such transaction might not have been marked as aborted.  See detailed
> comments at snapmgr.c where the variable is declared.
Done
>
> I am not able to understand the change in
> v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.  Do you have
> any explanation for the same?

It appears that in ReorderBufferCommitChild we are always setting the
final_lsn of the subxacts so it should not be invalid.  For testing, I
have changed this as an assert and checked but it never hit.  So maybe
we can remove this change.

Apart from that, I have fixed the toast tuple streaming bug by setting
the flag bit in the WAL (attached as 0012).  I have also extended this
solution for handling the speculative insert bug so old patch for a
speculative insert bug fix is removed. I am also exploring the
solution that how can we do this without setting the flag in the WAL
as we discussed upthread.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, Jan 31, 2020 at 8:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Also, if we need to copy the snapshot here, then do we need to again
> > > copy it in ReorderBufferProcessTXN(in below code and in catch block in
> > > the same function).
> > I think so because as part of the
> > "REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
> > point to the snapshot and that will get truncated when we truncate all
> > the changes of the ReorderBufferTXN.   So I think we can check if
> > snapshot_now->copied is true then we can avoid copying otherwise we
> > can copy?
> >
>
> Yeah, that makes sense, but I think then we also need to ensure that
> ReorderBufferStreamTXN frees the snapshot only when it is copied.  It
> seems to me it should be always copied in the place where we are
> trying to free it, so probably we should have an Assert there.
>
> One more thing:
> ReorderBufferProcessTXN()
> {
> ..
> + if (streaming)
> + {
> + /*
> + * While streaming an in-progress transaction there is a
> + * possibility that the (sub)transaction might get aborted
> + * concurrently.  In such case if the (sub)transaction has
> + * catalog update then we might decode the tuple using wrong
> + * catalog version.  So for detecting the concurrent abort we
> + * set CheckXidAlive to the current (sub)transaction's xid for
> + * which this change belongs to.  And, during catalog scan we
> + * can check the status of the xid and if it is aborted we will
> + * report an specific error which we can ignore.  We might have
> + * already streamed some of the changes for the aborted
> + * (sub)transaction, but that is fine because when we decode the
> + * abort we will stream abort message to truncate the changes in
> + * the subscriber.
> + */
> + CheckXidAlive = change->txn->xid;
> + }
> ..
> }
>
> I think it is better to move the above code into an inline function
> (something like SetXidAlive).  It will make the code in function
> ReorderBufferProcessTXN look cleaner and easier to understand.
>
Fixed in the latest version sent upthread.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Feb 5, 2020 at 9:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > One more thing we can do is to identify whether the tuple belongs to
> > > toast relation while decoding it.  However, I think to do that we need
> > > to have access to relcache at that time and that might add some
> > > overhead as we need to do that for each tuple.  Can we investigate
> > > what it will take to do that and if it is better than setting a bit
> > > during WAL logging.
> > >
> > I have done some more analysis on this and it appears that there are
> > few problems in doing this.  Basically, once we get the confirmed
> > flush location, we advance the replication_slot_catalog_xmin so that
> > vacuum can garbage collect the old tuple.  So the problem is that
> > while we are collecting the changes in the ReorderBuffer our catalog
> > version might have removed,  and we might not find any relation entry
> > with that relfilenodeid (because it is dropped or altered in the
> > future).
> >
>
> Hmm, this means this can also occur while streaming the changes.  The
> main reason as I understand is that it is because before decoding
> commit, we don't know whether these changes are already sent to the
> subscriber (based on confirmed_flush_location/start_decoding_at).
Right.

>I think it is better to skip streaming such transactions as we can't
> make the right decision about these and as this can happen generally
> after the crash for the first few transactions, it shouldn't matter
> much if we serialize such transactions instead of streaming them.

I think the idea makes sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Fixed in the latest version sent upthread.
>

Okay, thanks.  I haven't looked at the latest version of patch series
as I was reviewing the previous version and I think all of these
comments are in the patch which is not modified.  Here are my
comments:

I think we don't need to maintain
v8-0007-Support-logical_decoding_work_mem-set-from-create as per
discussion in one of the above emails [1] as its usage is not clear.

v8-0008-Add-support-for-streaming-to-built-in-replication
1.
-      information.  The allowed options are <literal>slot_name</literal> and
-      <literal>synchronous_commit</literal>
+      information.  The allowed options are <literal>slot_name</literal>,
+      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
+      and <literal>streaming</literal>.

As per the discussion above [1], I don't think we need work_mem here.
You might want to remove the other usage from the patch as well.

2.
@@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool
*connect, bool *enabled_given,
     bool *slot_name_given, char **slot_name,
     bool *copy_data, char **synchronous_commit,
     bool *refresh, int *logical_wm,
-    bool *logical_wm_given)
+    bool *logical_wm_given, bool *streaming,
+    bool *streaming_given)

It is not clear to me why we need two parameters 'streaming' and
'streaming_given' in this API.  Can't we handle similar to parameter
'refresh'?

3.
diff --git a/src/backend/replication/logical/launcher.c
b/src/backend/replication/logical/launcher.c
index aec885e..e80d00c 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,6 +14,8 @@
  *
  *-------------------------------------------------------------------------
  */
+#include <sys/types.h>
+#include <unistd.h>

 #include "postgres.h"

I see only the above change in launcher.c.  Why we need to include
these if there is no other change (at least not in this patch).

4.
stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
  /* Push callback + info on the error context stack */
  state.ctx = ctx;
  state.callback_name = "stream_start";
- /* state.report_location = apply_lsn; */
+ state.report_location = InvalidXLogRecPtr;
  errcallback.callback = output_plugin_error_callback;
  errcallback.arg = (void *) &state;
  errcallback.previous = error_context_stack;
@@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache,
ReorderBufferTXN *txn)
  /* Push callback + info on the error context stack */
  state.ctx = ctx;
  state.callback_name = "stream_stop";
- /* state.report_location = apply_lsn; */
+ state.report_location = InvalidXLogRecPtr;
  errcallback.callback = output_plugin_error_callback;
  errcallback.arg = (void *) &state;
  errcallback.previous = error_context_stack;

Don't we want to set txn->final_lsn in report location as we do at few
other places?

5.
-logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
+logicalrep_write_delete(StringInfo out, TransactionId xid,
+ Relation rel, HeapTuple oldtuple)
 {
+ pq_sendbyte(out, 'D'); /* action DELETE */
+
  Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
     rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
     rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);

- pq_sendbyte(out, 'D'); /* action DELETE */

Why this patch need to change the above code?

6.
+void
+logicalrep_write_stream_start(StringInfo out,
+   TransactionId xid, bool first_segment)
+{
+ pq_sendbyte(out, 'S'); /* action STREAM START */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+
+ /* 1 if this is the first streaming segment for this xid */
+ pq_sendint32(out, first_segment ? 1 : 0);
+}
+
+TransactionId
+logicalrep_read_stream_start(StringInfo in, bool *first_segment)
+{
+ TransactionId xid;
+
+ Assert(first_segment);
+
+ xid = pq_getmsgint(in, 4);
+ *first_segment = (pq_getmsgint(in, 4) == 1);
+
+ return xid;
+}

In these functions for sending bool, pq_sendint32 is used.  Can't we
use pq_sendbyte similar to what we do in boolsend?

7.
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+ pq_sendbyte(out, 'E'); /* action STREAM END */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+}

In comments, 'starting to stream' is mentioned whereas this function
is to stop it.

8.
+void
+logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
+{
+ pq_sendbyte(out, 'E'); /* action STREAM END */
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, xid);
+}
+
+TransactionId
+logicalrep_read_stream_stop(StringInfo in)
+{
+ TransactionId xid;
+
+ xid = pq_getmsgint(in, 4);
+
+ return xid;
+}

Is there a reason to send xid on stopping stream?  I don't see any use
of function logicalrep_read_stream_stop.

9.
+ * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)
{
..
+ pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
..
+ pgstat_report_wait_end();
..
}

I see the calls to pgstat_report_wait_start/pgstat_report_wait_end in
this function, so not sure if the above comment makes sense.

10.
+ * The files are placed in /tmp by default, and the filenames include both
+ * the XID of the toplevel transaction and OID of the subscription.

Are we keeping files in /tmp or pg's temp tablespace dir.  Seeing
below code, it doesn't seem that we place them in /tmp.  If I am
correct, then can you update the comment.
+static void
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+ char tempdirpath[MAXPGPATH];
+
+ TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);

11.
+ * The change is serialied in a simple format, with length (not including
+ * the length), action code (identifying the message type) and message
+ * contents (without the subxact TransactionId value).
+ *
..
+ */
+static void
+stream_write_change(char action, StringInfo s)

The part of the comment which says "with length (not including the
length) .." is not clear to me.  What does "not including the length"
mean?

12.
+ * TODO: Add missing_ok flag to specify in which cases it's OK not to
+ * find the files, and when it's an error.
+ */
+static void
+stream_cleanup_files(Oid subid, TransactionId xid)

I think we can implement this TODO.  It is clear when this function is
called from apply_handle_stream_commit, the file must exist.  We can
similarly analyze other callers of this API.

13.
+apply_handle_stream_abort(StringInfo s)
{
..
+ /* FIXME optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
..

I am not sure how important this optimization is, so instead of FIXME,
it is better to keep it as a XXX comment.  In the future, if we hit
any performance issue due to this, we can revisit our decision.

[1] - https://www.postgresql.org/message-id/CAA4eK1LH7xzF%2B-qHRv9EDXQTFYjPUYZw5B7FSK9QLEg7F603OQ%40mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Feb 5, 2020 at 9:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> >
> > I am not able to understand the change in
> > v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.  Do you have
> > any explanation for the same?
>
> It appears that in ReorderBufferCommitChild we are always setting the
> final_lsn of the subxacts so it should not be invalid.  For testing, I
> have changed this as an assert and checked but it never hit.  So maybe
> we can remove this change.
>

Tomas, do you remember anything about this change?  We are talking
about below change:

From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 26 Sep 2019 19:14:45 +0200
Subject: [PATCH v8 11/13] BUGFIX: set final_lsn for subxacts before cleanup

---
 src/backend/replication/logical/reorderbuffer.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index fe4e57c..beb6cd2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1327,6 +1327,10 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn)

  subtxn = dlist_container(ReorderBufferTXN, node, iter.cur);

+ /* make sure subtxn has final_lsn */
+ if (subtxn->final_lsn == InvalidXLogRecPtr)
+ subtxn->final_lsn = txn->final_lsn;
+

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Fixed in the latest version sent upthread.
> >
>
> Okay, thanks.  I haven't looked at the latest version of patch series
> as I was reviewing the previous version and I think all of these
> comments are in the patch which is not modified.  Here are my
> comments:
>
> I think we don't need to maintain
> v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> discussion in one of the above emails [1] as its usage is not clear.
>
> v8-0008-Add-support-for-streaming-to-built-in-replication
> 1.
> -      information.  The allowed options are <literal>slot_name</literal> and
> -      <literal>synchronous_commit</literal>
> +      information.  The allowed options are <literal>slot_name</literal>,
> +      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
> +      and <literal>streaming</literal>.
>
> As per the discussion above [1], I don't think we need work_mem here.
> You might want to remove the other usage from the patch as well.

After putting more thought on this it appears that there could be some
use cases for setting the work_mem from the subscription,  Assume a
case where data are coming from two different origins and based on the
origin ids different slots might collect different type of changes,
So isn't it good to have different work_mem for different slots?  I am
not saying that the current way of implementing is the best one but
that we can improve.  First, we need to decide whether we have a use
case for this or not.  Please let me know your thought on the same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Feb 7, 2020 at 4:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Fixed in the latest version sent upthread.
> > >
> >
> > Okay, thanks.  I haven't looked at the latest version of patch series
> > as I was reviewing the previous version and I think all of these
> > comments are in the patch which is not modified.  Here are my
> > comments:
> >
> > I think we don't need to maintain
> > v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> > discussion in one of the above emails [1] as its usage is not clear.
> >
> > v8-0008-Add-support-for-streaming-to-built-in-replication
> > 1.
> > -      information.  The allowed options are <literal>slot_name</literal> and
> > -      <literal>synchronous_commit</literal>
> > +      information.  The allowed options are <literal>slot_name</literal>,
> > +      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
> > +      and <literal>streaming</literal>.
> >
> > As per the discussion above [1], I don't think we need work_mem here.
> > You might want to remove the other usage from the patch as well.
>
> After putting more thought on this it appears that there could be some
> use cases for setting the work_mem from the subscription,  Assume a
> case where data are coming from two different origins and based on the
> origin ids different slots might collect different type of changes,
> So isn't it good to have different work_mem for different slots?  I am
> not saying that the current way of implementing is the best one but
> that we can improve.  First, we need to decide whether we have a use
> case for this or not.
>

That is the whole point.  I don't see a very clear usage of this and
neither did anybody explained clearly how it will be useful.  I am not
denying that what you are describing has no use, but as you said we
might need to invent an entirely new way even if we have such a use.
I think it is better to avoid the requirements which are not essential
for this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Feb 10, 2020 at 1:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 7, 2020 at 4:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > Fixed in the latest version sent upthread.
> > > >
> > >
> > > Okay, thanks.  I haven't looked at the latest version of patch series
> > > as I was reviewing the previous version and I think all of these
> > > comments are in the patch which is not modified.  Here are my
> > > comments:
> > >
> > > I think we don't need to maintain
> > > v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> > > discussion in one of the above emails [1] as its usage is not clear.
> > >
> > > v8-0008-Add-support-for-streaming-to-built-in-replication
> > > 1.
> > > -      information.  The allowed options are <literal>slot_name</literal> and
> > > -      <literal>synchronous_commit</literal>
> > > +      information.  The allowed options are <literal>slot_name</literal>,
> > > +      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
> > > +      and <literal>streaming</literal>.
> > >
> > > As per the discussion above [1], I don't think we need work_mem here.
> > > You might want to remove the other usage from the patch as well.
> >
> > After putting more thought on this it appears that there could be some
> > use cases for setting the work_mem from the subscription,  Assume a
> > case where data are coming from two different origins and based on the
> > origin ids different slots might collect different type of changes,
> > So isn't it good to have different work_mem for different slots?  I am
> > not saying that the current way of implementing is the best one but
> > that we can improve.  First, we need to decide whether we have a use
> > case for this or not.
> >
>
> That is the whole point.  I don't see a very clear usage of this and
> neither did anybody explained clearly how it will be useful.  I am not
> denying that what you are describing has no use, but as you said we
> might need to invent an entirely new way even if we have such a use.
> I think it is better to avoid the requirements which are not essential
> for this patch.

Ok, I will include this change in the next patch set.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think we don't need to maintain
> v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> discussion in one of the above emails [1] as its usage is not clear.

Done

> v8-0008-Add-support-for-streaming-to-built-in-replication
> 1.
> -      information.  The allowed options are <literal>slot_name</literal> and
> -      <literal>synchronous_commit</literal>
> +      information.  The allowed options are <literal>slot_name</literal>,
> +      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
> +      and <literal>streaming</literal>.
>
> As per the discussion above [1], I don't think we need work_mem here.
> You might want to remove the other usage from the patch as well.

Done

> 2.
> @@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool
> *connect, bool *enabled_given,
>      bool *slot_name_given, char **slot_name,
>      bool *copy_data, char **synchronous_commit,
>      bool *refresh, int *logical_wm,
> -    bool *logical_wm_given)
> +    bool *logical_wm_given, bool *streaming,
> +    bool *streaming_given)
>
> It is not clear to me why we need two parameters 'streaming' and
> 'streaming_given' in this API.  Can't we handle similar to parameter
> 'refresh'?

The streaming option we need to update in the system table, so if we
don't remember whether the user has given its value or not then how we
will know whether to update this column or not?  Or you are suggesting
that we should always mark this as updated but IMHO that is not a good
idea.

> 3.
> diff --git a/src/backend/replication/logical/launcher.c
> b/src/backend/replication/logical/launcher.c
> index aec885e..e80d00c 100644
> --- a/src/backend/replication/logical/launcher.c
> +++ b/src/backend/replication/logical/launcher.c
> @@ -14,6 +14,8 @@
>   *
>   *-------------------------------------------------------------------------
>   */
> +#include <sys/types.h>
> +#include <unistd.h>
>
>  #include "postgres.h"
>
> I see only the above change in launcher.c.  Why we need to include
> these if there is no other change (at least not in this patch).

Removed

> 4.
> stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
>   /* Push callback + info on the error context stack */
>   state.ctx = ctx;
>   state.callback_name = "stream_start";
> - /* state.report_location = apply_lsn; */
> + state.report_location = InvalidXLogRecPtr;
>   errcallback.callback = output_plugin_error_callback;
>   errcallback.arg = (void *) &state;
>   errcallback.previous = error_context_stack;
> @@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache,
> ReorderBufferTXN *txn)
>   /* Push callback + info on the error context stack */
>   state.ctx = ctx;
>   state.callback_name = "stream_stop";
> - /* state.report_location = apply_lsn; */
> + state.report_location = InvalidXLogRecPtr;
>   errcallback.callback = output_plugin_error_callback;
>   errcallback.arg = (void *) &state;
>   errcallback.previous = error_context_stack;
>
> Don't we want to set txn->final_lsn in report location as we do at few
> other places?

Fixed

> 5.
> -logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
> +logicalrep_write_delete(StringInfo out, TransactionId xid,
> + Relation rel, HeapTuple oldtuple)
>  {
> + pq_sendbyte(out, 'D'); /* action DELETE */
> +
>   Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
>      rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
>      rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
>
> - pq_sendbyte(out, 'D'); /* action DELETE */
>
> Why this patch need to change the above code?

Fixed

> 6.
> +void
> +logicalrep_write_stream_start(StringInfo out,
> +   TransactionId xid, bool first_segment)
> +{
> + pq_sendbyte(out, 'S'); /* action STREAM START */
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, xid);
> +
> + /* 1 if this is the first streaming segment for this xid */
> + pq_sendint32(out, first_segment ? 1 : 0);
> +}
> +
> +TransactionId
> +logicalrep_read_stream_start(StringInfo in, bool *first_segment)
> +{
> + TransactionId xid;
> +
> + Assert(first_segment);
> +
> + xid = pq_getmsgint(in, 4);
> + *first_segment = (pq_getmsgint(in, 4) == 1);
> +
> + return xid;
> +}
>
> In these functions for sending bool, pq_sendint32 is used.  Can't we
> use pq_sendbyte similar to what we do in boolsend?

Done

> 7.
> +void
> +logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
> +{
> + pq_sendbyte(out, 'E'); /* action STREAM END */
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, xid);
> +}
>
> In comments, 'starting to stream' is mentioned whereas this function
> is to stop it.

Fixed

> 8.
> +void
> +logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
> +{
> + pq_sendbyte(out, 'E'); /* action STREAM END */
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, xid);
> +}
> +
> +TransactionId
> +logicalrep_read_stream_stop(StringInfo in)
> +{
> + TransactionId xid;
> +
> + xid = pq_getmsgint(in, 4);
> +
> + return xid;
> +}
>
> Is there a reason to send xid on stopping stream?  I don't see any use
> of function logicalrep_read_stream_stop.

Removed

> 9.
> + * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
> + */
> +static void
> +subxact_info_write(Oid subid, TransactionId xid)
> {
> ..
> + pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
> ..
> + pgstat_report_wait_end();
> ..
> }
>
> I see the calls to pgstat_report_wait_start/pgstat_report_wait_end in
> this function, so not sure if the above comment makes sense.

Fixed

> 10.
> + * The files are placed in /tmp by default, and the filenames include both
> + * the XID of the toplevel transaction and OID of the subscription.
>
> Are we keeping files in /tmp or pg's temp tablespace dir.  Seeing
> below code, it doesn't seem that we place them in /tmp.  If I am
> correct, then can you update the comment.
> +static void
> +subxact_filename(char *path, Oid subid, TransactionId xid)
> +{
> + char tempdirpath[MAXPGPATH];
> +
> + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);

Done

> 11.
> + * The change is serialied in a simple format, with length (not including
> + * the length), action code (identifying the message type) and message
> + * contents (without the subxact TransactionId value).
> + *
> ..
> + */
> +static void
> +stream_write_change(char action, StringInfo s)
>
> The part of the comment which says "with length (not including the
> length) .." is not clear to me.  What does "not including the length"
> mean?

Basically, it says that the 4 bytes which are used for storing then
the length of total data doesn't include the 4 bytes.

> 12.
> + * TODO: Add missing_ok flag to specify in which cases it's OK not to
> + * find the files, and when it's an error.
> + */
> +static void
> +stream_cleanup_files(Oid subid, TransactionId xid)
>
> I think we can implement this TODO.  It is clear when this function is
> called from apply_handle_stream_commit, the file must exist.  We can
> similarly analyze other callers of this API.

Done

> 13.
> +apply_handle_stream_abort(StringInfo s)
> {
> ..
> + /* FIXME optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> ..
>
> I am not sure how important this optimization is, so instead of FIXME,
> it is better to keep it as a XXX comment.  In the future, if we hit
> any performance issue due to this, we can revisit our decision.

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Feb 13, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Feb 11, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> The patch set was not applying on the head so I have rebased it.

I have changed the patch 0002 so that instead of logging the WAL for
each invalidation, now we log at each command end as discussed
upthread[1]

Soon we will evaluate the performance for the same and post the results.

[1] https://www.postgresql.org/message-id/CAA4eK1LOa%2B2KqNX%3Dm%3D1qMBDW%2Bo50AuwjAOX6ZqL-rWGiH1F9MQ%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
Hi,

I started looking at this patch series again, hoping to get it moving
for PG13. There's been a tremendous amount of work done since I last
worked on it, and a lot was discussed on this thread, so it'll take a
while to get familiar with the new code ...

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

The question is, why couldn't the replica use the same assignment info
we already write for logical decoding? The main challenge is that now
the assignment can be sent in many different xlog messages, from a bunch
of resource managers (essentially, any xlog message with a xid can have
embedded XID of the toplevel xact). So the handling would either need to
happen in every rmgr, or we need to move it before we call the rmgr.

For exampple, we might do this e.g. in StartupXLOG() I think, per the
attached patch (FWIW this particular fix was written by Masahiko Sawada,
not me). This does the trick for me - I'm no longer able to reproduce
the KnownAssignedXids overflow.

The one difference is that we used to call ProcArrayApplyXidAssignment
for larger groups of XIDs, as sent in the assignment message. Now we
call it for each individual assignment. I don't know if this is an
issue, but I suppose we might introduce some sort of local caching
(accumulate the assignments into a local array, call the function only
when we have enough of them).

Aside from that, I think there's a minor bug in xact.c - the patch adds
a "assigned" field to TransactionStateData, but then it fails to add a
default value into TopTransactionStateData. We probably interpret NULL
as false, but then there's nothing for the pointer. I suspect it might
leave some random garbage there, leading to strange things later.

Another thing I noticed is LogicalDecodingProcessRecord() extracts the
toplevel XID using a macro

   txid = XLogRecGetTopXid(record);

but then it just starts accessing the fields directly again in the
ReorderBufferAssignChild call. I think we should do this instead:

     ReorderBufferAssignChild(ctx->reorder,
                              txid,
                 XLogRecGetXid(record),
                              buf.origptr);


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
D'oh! As usual I forgot to actually attach the patch I mentioned. So
here it is ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> I started looking at this patch series again, hoping to get it moving
> for PG13.

Nice.

 There's been a tremendous amount of work done since I last
> worked on it, and a lot was discussed on this thread, so it'll take a
> while to get familiar with the new code ...
>
> The first thing I realized that WAL-logging of assignments in v12 does
> both the "old" logging (using dedicated message) and "new" with
> toplevel-XID embedded in the first message. Yes, the patch was wrong,
> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> it was trivial to crash the replica due to KnownAssignedXids overflow.
> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> right fix.
>
> I actually proposed doing this (having both ways to log assignments) so
> that there's no regression risk with (wal_level < logical). But IIRC
> Andres objected to it, argumenting that we should not log the same piece
> of information in two very different ways at the same time (IIRC it was
> discussed on the FOSDEM dev meeting, so I don't have a link to share).
> And I do agree with him ...
>
> The question is, why couldn't the replica use the same assignment info
> we already write for logical decoding? The main challenge is that now
> the assignment can be sent in many different xlog messages, from a bunch
> of resource managers (essentially, any xlog message with a xid can have
> embedded XID of the toplevel xact). So the handling would either need to
> happen in every rmgr, or we need to move it before we call the rmgr.
>
> For exampple, we might do this e.g. in StartupXLOG() I think, per the
> attached patch (FWIW this particular fix was written by Masahiko Sawada,
> not me). This does the trick for me - I'm no longer able to reproduce
> the KnownAssignedXids overflow.
>
> The one difference is that we used to call ProcArrayApplyXidAssignment
> for larger groups of XIDs, as sent in the assignment message. Now we
> call it for each individual assignment. I don't know if this is an
> issue, but I suppose we might introduce some sort of local caching
> (accumulate the assignments into a local array, call the function only
> when we have enough of them).

Thanks for the pointers,  I will think over these points.

>
> Aside from that, I think there's a minor bug in xact.c - the patch adds
> a "assigned" field to TransactionStateData, but then it fails to add a
> default value into TopTransactionStateData. We probably interpret NULL
> as false, but then there's nothing for the pointer. I suspect it might
> leave some random garbage there, leading to strange things later.

Actually, we will never access that field for the
TopTransactionStateData, right?
See below code,  we have a check that only if IsSubTransaction(), then
we access the "assigned" filed.

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}

>
> Another thing I noticed is LogicalDecodingProcessRecord() extracts the
> toplevel XID using a macro
>
>    txid = XLogRecGetTopXid(record);
>
> but then it just starts accessing the fields directly again in the
> ReorderBufferAssignChild call. I think we should do this instead:
>
>      ReorderBufferAssignChild(ctx->reorder,
>                               txid,
>                              XLogRecGetXid(record),
>                               buf.origptr);

Make sense.  I will change this in the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> I started looking at this patch series again, hoping to get it moving
> for PG13.
>

It is good to keep moving this forward, but there are quite a few
problems with the design which need a broader discussion.  Some of
what I recall are:
a. Handling of abort of concurrent transactions.  There is some code
in the patch which might work, but there is not much discussion when
it was posted.
b. Handling of partial tuples (while streaming, we came to know that
toast tuple is not complete or speculative insert is incomplete).  For
this also, we have proposed a few solutions which need further
discussion.  One of those is implemented in the patch series.
c. We might also need some handling for replication origins.
d. Try to minimize the performance overhead of WAL logging for
invalidations.  We discussed different solutions for this and
implemented one of those.
e. How to skip already streamed transactions.

There might be a few more which I can't recall now.  Apart from this,
I haven't done any detailed review of subscriber-side implementation
where we write streamed transactions to file.  All of this will need
much more discussion and review before we can say it is ready to
commit, so I thought it might be better to pick it up for PG14 and
focus on other things that have a better chance for PG13 especially
because all the problems were not solved/discussed before last CF.
However, it is a good idea to keep moving this and have a discussion
on some of these issues.

> There's been a tremendous amount of work done since I last
> worked on it, and a lot was discussed on this thread, so it'll take a
> while to get familiar with the new code ...
>
> The first thing I realized that WAL-logging of assignments in v12 does
> both the "old" logging (using dedicated message) and "new" with
> toplevel-XID embedded in the first message. Yes, the patch was wrong,
> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> it was trivial to crash the replica due to KnownAssignedXids overflow.
> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> right fix.
>
> I actually proposed doing this (having both ways to log assignments) so
> that there's no regression risk with (wal_level < logical). But IIRC
> Andres objected to it, argumenting that we should not log the same piece
> of information in two very different ways at the same time (IIRC it was
> discussed on the FOSDEM dev meeting, so I don't have a link to share).
> And I do agree with him ...
>

So, aren't we worried about the overhead of the amount of WAL and
performance impact for the transactions?  We might want to check the
pgbench read-write test to see if that will add any significant
overhead.

> The question is, why couldn't the replica use the same assignment info
> we already write for logical decoding?
>

I haven't thought about it in detail, but we can think on those lines
if the performance overhead is in the acceptable range.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > The first thing I realized that WAL-logging of assignments in v12 does
> > both the "old" logging (using dedicated message) and "new" with
> > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > right fix.
> >
> > I actually proposed doing this (having both ways to log assignments) so
> > that there's no regression risk with (wal_level < logical). But IIRC
> > Andres objected to it, argumenting that we should not log the same piece
> > of information in two very different ways at the same time (IIRC it was
> > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > And I do agree with him ...
> >
>
> So, aren't we worried about the overhead of the amount of WAL and
> performance impact for the transactions?  We might want to check the
> pgbench read-write test to see if that will add any significant
> overhead.
>

I have briefly looked at the original patch and it seems the
additional overhead is only when subtransactions are involved, so
ideally, it shouldn't impact default pgbench, but there is no harm in
checking.  It might be that we need to build a custom script with
subtransactions involved to measure the impact, but I think it is
worth checking

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Mar 4, 2020 at 2:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > The first thing I realized that WAL-logging of assignments in v12 does
> > > both the "old" logging (using dedicated message) and "new" with
> > > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > > right fix.
> > >
> > > I actually proposed doing this (having both ways to log assignments) so
> > > that there's no regression risk with (wal_level < logical). But IIRC
> > > Andres objected to it, argumenting that we should not log the same piece
> > > of information in two very different ways at the same time (IIRC it was
> > > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > > And I do agree with him ...
> > >
> >
> > So, aren't we worried about the overhead of the amount of WAL and
> > performance impact for the transactions?  We might want to check the
> > pgbench read-write test to see if that will add any significant
> > overhead.
> >
>
> I have briefly looked at the original patch and it seems the
> additional overhead is only when subtransactions are involved, so
> ideally, it shouldn't impact default pgbench, but there is no harm in
> checking.  It might be that we need to build a custom script with
> subtransactions involved to measure the impact, but I think it is
> worth checking

I agree.  I will test the same and post the results.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote:
>On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> I started looking at this patch series again, hoping to get it moving
>> for PG13.
>>
>
>It is good to keep moving this forward, but there are quite a few
>problems with the design which need a broader discussion.  Some of
>what I recall are:
>a. Handling of abort of concurrent transactions.  There is some code
>in the patch which might work, but there is not much discussion when
>it was posted.
>b. Handling of partial tuples (while streaming, we came to know that
>toast tuple is not complete or speculative insert is incomplete).  For
>this also, we have proposed a few solutions which need further
>discussion.  One of those is implemented in the patch series.
>c. We might also need some handling for replication origins.
>d. Try to minimize the performance overhead of WAL logging for
>invalidations.  We discussed different solutions for this and
>implemented one of those.
>e. How to skip already streamed transactions.
>
>There might be a few more which I can't recall now.  Apart from this,
>I haven't done any detailed review of subscriber-side implementation
>where we write streamed transactions to file.  All of this will need
>much more discussion and review before we can say it is ready to
>commit, so I thought it might be better to pick it up for PG14 and
>focus on other things that have a better chance for PG13 especially
>because all the problems were not solved/discussed before last CF.
>However, it is a good idea to keep moving this and have a discussion
>on some of these issues.
>

Sure, there's a lot to discuss. And it's possible (likely) it's not
feasible to get this into PG13. But I think it's still worth discussing
it, instead of just punting it into the next CF right away.

>> There's been a tremendous amount of work done since I last
>> worked on it, and a lot was discussed on this thread, so it'll take a
>> while to get familiar with the new code ...
>>
>> The first thing I realized that WAL-logging of assignments in v12 does
>> both the "old" logging (using dedicated message) and "new" with
>> toplevel-XID embedded in the first message. Yes, the patch was wrong,
>> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
>> it was trivial to crash the replica due to KnownAssignedXids overflow.
>> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
>> right fix.
>>
>> I actually proposed doing this (having both ways to log assignments) so
>> that there's no regression risk with (wal_level < logical). But IIRC
>> Andres objected to it, argumenting that we should not log the same piece
>> of information in two very different ways at the same time (IIRC it was
>> discussed on the FOSDEM dev meeting, so I don't have a link to share).
>> And I do agree with him ...
>>
>
>So, aren't we worried about the overhead of the amount of WAL and
>performance impact for the transactions?  We might want to check the
>pgbench read-write test to see if that will add any significant
>overhead.
>

Well, sure. I agree we need to see how this affects performance, and
I'll do some benchmarks (I think I did that when submitting the patch,
but I don't recall the numbers / details).

Isn't it a bit strange to log stuff twice, though, if we worry about
performance? Surely that's more expensive than logging it just once. Of
course, it might be useful if most systems need just the "old" way.

I know it's going to be a bit hand-wavy, but I think embedding the
assignments into existing WAL messages is about the cheapest way to log
this. I would not expect this to be mesurably more expensive than what
we have now, but I might be wrong.

>> The question is, why couldn't the replica use the same assignment info
>> we already write for logical decoding?
>>
>
>I haven't thought about it in detail, but we can think on those lines
>if the performance overhead is in the acceptable range.
>

OK, let me do some measurements ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Wed, Mar 04, 2020 at 09:13:49AM +0530, Dilip Kumar wrote:
>On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> I started looking at this patch series again, hoping to get it moving
>> for PG13.
>
>Nice.
>
> There's been a tremendous amount of work done since I last
>> worked on it, and a lot was discussed on this thread, so it'll take a
>> while to get familiar with the new code ...
>>
>> The first thing I realized that WAL-logging of assignments in v12 does
>> both the "old" logging (using dedicated message) and "new" with
>> toplevel-XID embedded in the first message. Yes, the patch was wrong,
>> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
>> it was trivial to crash the replica due to KnownAssignedXids overflow.
>> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
>> right fix.
>>
>> I actually proposed doing this (having both ways to log assignments) so
>> that there's no regression risk with (wal_level < logical). But IIRC
>> Andres objected to it, argumenting that we should not log the same piece
>> of information in two very different ways at the same time (IIRC it was
>> discussed on the FOSDEM dev meeting, so I don't have a link to share).
>> And I do agree with him ...
>>
>> The question is, why couldn't the replica use the same assignment info
>> we already write for logical decoding? The main challenge is that now
>> the assignment can be sent in many different xlog messages, from a bunch
>> of resource managers (essentially, any xlog message with a xid can have
>> embedded XID of the toplevel xact). So the handling would either need to
>> happen in every rmgr, or we need to move it before we call the rmgr.
>>
>> For exampple, we might do this e.g. in StartupXLOG() I think, per the
>> attached patch (FWIW this particular fix was written by Masahiko Sawada,
>> not me). This does the trick for me - I'm no longer able to reproduce
>> the KnownAssignedXids overflow.
>>
>> The one difference is that we used to call ProcArrayApplyXidAssignment
>> for larger groups of XIDs, as sent in the assignment message. Now we
>> call it for each individual assignment. I don't know if this is an
>> issue, but I suppose we might introduce some sort of local caching
>> (accumulate the assignments into a local array, call the function only
>> when we have enough of them).
>
>Thanks for the pointers,  I will think over these points.
>
>>
>> Aside from that, I think there's a minor bug in xact.c - the patch adds
>> a "assigned" field to TransactionStateData, but then it fails to add a
>> default value into TopTransactionStateData. We probably interpret NULL
>> as false, but then there's nothing for the pointer. I suspect it might
>> leave some random garbage there, leading to strange things later.
>
>Actually, we will never access that field for the
>TopTransactionStateData, right?
>See below code,  we have a check that only if IsSubTransaction(), then
>we access the "assigned" filed.
>
>+bool
>+IsSubTransactionAssignmentPending(void)
>+{
>+ if (!XLogLogicalInfoActive())
>+ return false;
>+
>+ /* we need to be in a transaction state */
>+ if (!IsTransactionState())
>+ return false;
>+
>+ /* it has to be a subtransaction */
>+ if (!IsSubTransaction())
>+ return false;
>+
>+ /* the subtransaction has to have a XID assigned */
>+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
>+ return false;
>+
>+ /* and it needs to have 'assigned' */
>+ return !CurrentTransactionState->assigned;
>+
>+}
>

The problem is not with the "assigned" field, really. AFAICS we probably
initialize it to false because we interpret NULL as false. My concern
was that we essentially leave the last pointer not initialized. That
seems like a bug, not sure if it breaks something in practice.

>>
>> Another thing I noticed is LogicalDecodingProcessRecord() extracts the
>> toplevel XID using a macro
>>
>>    txid = XLogRecGetTopXid(record);
>>
>> but then it just starts accessing the fields directly again in the
>> ReorderBufferAssignChild call. I think we should do this instead:
>>
>>      ReorderBufferAssignChild(ctx->reorder,
>>                               txid,
>>                              XLogRecGetXid(record),
>>                               buf.origptr);
>
>Make sense.  I will change this in the patch.
>

+1, thanks


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Mar 5, 2020 at 11:20 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote:
> >
>
> Sure, there's a lot to discuss. And it's possible (likely) it's not
> feasible to get this into PG13. But I think it's still worth discussing
> it, instead of just punting it into the next CF right away.
>

That makes sense to me.

> >> There's been a tremendous amount of work done since I last
> >> worked on it, and a lot was discussed on this thread, so it'll take a
> >> while to get familiar with the new code ...
> >>
> >> The first thing I realized that WAL-logging of assignments in v12 does
> >> both the "old" logging (using dedicated message) and "new" with
> >> toplevel-XID embedded in the first message. Yes, the patch was wrong,
> >> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> >> it was trivial to crash the replica due to KnownAssignedXids overflow.
> >> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> >> right fix.
> >>
> >> I actually proposed doing this (having both ways to log assignments) so
> >> that there's no regression risk with (wal_level < logical). But IIRC
> >> Andres objected to it, argumenting that we should not log the same piece
> >> of information in two very different ways at the same time (IIRC it was
> >> discussed on the FOSDEM dev meeting, so I don't have a link to share).
> >> And I do agree with him ...
> >>
> >
> >So, aren't we worried about the overhead of the amount of WAL and
> >performance impact for the transactions?  We might want to check the
> >pgbench read-write test to see if that will add any significant
> >overhead.
> >
>
> Well, sure. I agree we need to see how this affects performance, and
> I'll do some benchmarks (I think I did that when submitting the patch,
> but I don't recall the numbers / details).
>
> Isn't it a bit strange to log stuff twice, though, if we worry about
> performance? Surely that's more expensive than logging it just once. Of
> course, it might be useful if most systems need just the "old" way.
>
> I know it's going to be a bit hand-wavy, but I think embedding the
> assignments into existing WAL messages is about the cheapest way to log
> this. I would not expect this to be mesurably more expensive than what
> we have now, but I might be wrong.
>

I agree that this shouldn't be much expensive, but it is better to be
sure in that regard.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> >
> > The first thing I realized that WAL-logging of assignments in v12 does
> > both the "old" logging (using dedicated message) and "new" with
> > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > right fix.
> >
> > I actually proposed doing this (having both ways to log assignments) so
> > that there's no regression risk with (wal_level < logical). But IIRC
> > Andres objected to it, argumenting that we should not log the same piece
> > of information in two very different ways at the same time (IIRC it was
> > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > And I do agree with him ...
> >
> > The question is, why couldn't the replica use the same assignment info
> > we already write for logical decoding? The main challenge is that now
> > the assignment can be sent in many different xlog messages, from a bunch
> > of resource managers (essentially, any xlog message with a xid can have
> > embedded XID of the toplevel xact). So the handling would either need to
> > happen in every rmgr, or we need to move it before we call the rmgr.
> >
> > For exampple, we might do this e.g. in StartupXLOG() I think, per the
> > attached patch (FWIW this particular fix was written by Masahiko Sawada,
> > not me). This does the trick for me - I'm no longer able to reproduce
> > the KnownAssignedXids overflow.
> >
> > The one difference is that we used to call ProcArrayApplyXidAssignment
> > for larger groups of XIDs, as sent in the assignment message. Now we
> > call it for each individual assignment. I don't know if this is an
> > issue, but I suppose we might introduce some sort of local caching
> > (accumulate the assignments into a local array, call the function only
> > when we have enough of them).
>
> Thanks for the pointers,  I will think over these points.
>

I have looked at the solution proposed and I would like to share my
findings.  I think calling ProcArrayApplyXidAssignment for each
subtransaction is not a good idea for a couple of reasons:
(a) It will just beat the purpose of maintaining KnowAssignedXids
array which is to avoid looking at pg_subtrans in
TransactionIdIsInProgress() on standby.  Basically, if we remove it
for each subXid, it will consider the KnowAssignedXids to be
overflowed and check pg_subtrans frequently.
(b)  Calling ProcArrayApplyXidAssignment() for each subtransaction can
be costly from the perspective of concurrency because it acquires
ProcArrayLock in Exclusive mode, so concurrently running transactions
might start blocking at this lock.  Also, I see that
SubTransSetParent() makes the page dirty, so it might lead to more
writes if we spread out setting that by calling it separately for each
sub-transaction.

Apart from this, I don't see how the proposed fix is correct because
as far as I can see it tries to remove the Xid before we even record
it via RecordKnownAssignedTransactionIds().  It seems after patch
RecordKnownAssignedTransactionIds() will be called after
ProcArrayApplyXidAssignment(), how could that be correct.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > >
> > > The first thing I realized that WAL-logging of assignments in v12 does
> > > both the "old" logging (using dedicated message) and "new" with
> > > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > > right fix.
> > >
> > > I actually proposed doing this (having both ways to log assignments) so
> > > that there's no regression risk with (wal_level < logical). But IIRC
> > > Andres objected to it, argumenting that we should not log the same piece
> > > of information in two very different ways at the same time (IIRC it was
> > > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > > And I do agree with him ...
> > >
> > > The question is, why couldn't the replica use the same assignment info
> > > we already write for logical decoding? The main challenge is that now
> > > the assignment can be sent in many different xlog messages, from a bunch
> > > of resource managers (essentially, any xlog message with a xid can have
> > > embedded XID of the toplevel xact). So the handling would either need to
> > > happen in every rmgr, or we need to move it before we call the rmgr.
> > >
> > > For exampple, we might do this e.g. in StartupXLOG() I think, per the
> > > attached patch (FWIW this particular fix was written by Masahiko Sawada,
> > > not me). This does the trick for me - I'm no longer able to reproduce
> > > the KnownAssignedXids overflow.
> > >
> > > The one difference is that we used to call ProcArrayApplyXidAssignment
> > > for larger groups of XIDs, as sent in the assignment message. Now we
> > > call it for each individual assignment. I don't know if this is an
> > > issue, but I suppose we might introduce some sort of local caching
> > > (accumulate the assignments into a local array, call the function only
> > > when we have enough of them).
> >
> > Thanks for the pointers,  I will think over these points.
> >
>
> I have looked at the solution proposed and I would like to share my
> findings.  I think calling ProcArrayApplyXidAssignment for each
> subtransaction is not a good idea for a couple of reasons:
> (a) It will just beat the purpose of maintaining KnowAssignedXids
> array which is to avoid looking at pg_subtrans in
> TransactionIdIsInProgress() on standby.  Basically, if we remove it
> for each subXid, it will consider the KnowAssignedXids to be
> overflowed and check pg_subtrans frequently.

Right, I also think this is a problem with this solution.  I think we
may try to avoid this by caching this information.  But, then we will
have to maintain this in some dimensional array which stores
sub-transaction ids per top transaction or we can maintain a list of
sub-transaction for each transaction.  I haven't thought about how
much complexity this solution will add.

> (b)  Calling ProcArrayApplyXidAssignment() for each subtransaction can
> be costly from the perspective of concurrency because it acquires
> ProcArrayLock in Exclusive mode, so concurrently running transactions
> might start blocking at this lock.

Right

 Also, I see that
> SubTransSetParent() makes the page dirty, so it might lead to more
> writes if we spread out setting that by calling it separately for each
> sub-transaction.

Right.

>
> Apart from this, I don't see how the proposed fix is correct because
> as far as I can see it tries to remove the Xid before we even record
> it via RecordKnownAssignedTransactionIds().  It seems after patch
> RecordKnownAssignedTransactionIds() will be called after
> ProcArrayApplyXidAssignment(), how could that be correct.

Valid point.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I have looked at the solution proposed and I would like to share my
> > findings.  I think calling ProcArrayApplyXidAssignment for each
> > subtransaction is not a good idea for a couple of reasons:
> > (a) It will just beat the purpose of maintaining KnowAssignedXids
> > array which is to avoid looking at pg_subtrans in
> > TransactionIdIsInProgress() on standby.  Basically, if we remove it
> > for each subXid, it will consider the KnowAssignedXids to be
> > overflowed and check pg_subtrans frequently.
>
> Right, I also think this is a problem with this solution.  I think we
> may try to avoid this by caching this information.  But, then we will
> have to maintain this in some dimensional array which stores
> sub-transaction ids per top transaction or we can maintain a list of
> sub-transaction for each transaction.  I haven't thought about how
> much complexity this solution will add.
>

How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
flag in TransactionStateData and then log that as special information
whenever we write next WAL record for a new subtransaction?  Then
during recovery, we can only call ProcArrayApplyXidAssignment when we
find that special flag is set in a WAL record.  One idea could be to
use a flag bit in XLogRecord.xl_info.  If that is feasible then the
solution can work as it is now, without any overhead or change in the
way we maintain KnownAssignedXids.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:
>On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> >
>> > I have looked at the solution proposed and I would like to share my
>> > findings.  I think calling ProcArrayApplyXidAssignment for each
>> > subtransaction is not a good idea for a couple of reasons:
>> > (a) It will just beat the purpose of maintaining KnowAssignedXids
>> > array which is to avoid looking at pg_subtrans in
>> > TransactionIdIsInProgress() on standby.  Basically, if we remove it
>> > for each subXid, it will consider the KnowAssignedXids to be
>> > overflowed and check pg_subtrans frequently.
>>
>> Right, I also think this is a problem with this solution.  I think we
>> may try to avoid this by caching this information.  But, then we will
>> have to maintain this in some dimensional array which stores
>> sub-transaction ids per top transaction or we can maintain a list of
>> sub-transaction for each transaction.  I haven't thought about how
>> much complexity this solution will add.
>>
>
>How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
>flag in TransactionStateData and then log that as special information
>whenever we write next WAL record for a new subtransaction?  Then
>during recovery, we can only call ProcArrayApplyXidAssignment when we
>find that special flag is set in a WAL record.  One idea could be to
>use a flag bit in XLogRecord.xl_info.  If that is feasible then the
>solution can work as it is now, without any overhead or change in the
>way we maintain KnownAssignedXids.
>

Ummm, how is that different from what the patch is doing now? I mean, we
only write the top-level XID for the first WAL record in each subxact,
right? Or what would be the difference with your approach?

Anyway, I think you're right the ProcArrayApplyXidAssignment call was
done too early, but I think that can be fixed by moving it until after
the RecordKnownAssignedTransactionIds call, no? Essentially, right
before rm_redo().

You're right calling ProcArrayApplyXidAssignment() may be an issue,
because it exclusively acquires the ProcArrayLock. I've actually hinted
that might be an issue in my original message, suggesting we might add a
local cache of assigned XIDs (a small static array, doing essentially
the same thing we used to do on the upstream node). I haven't done that
in my WIP patch to keep it simple, but AFACS it'd work.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:
> >On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
> >flag in TransactionStateData and then log that as special information
> >whenever we write next WAL record for a new subtransaction?  Then
> >during recovery, we can only call ProcArrayApplyXidAssignment when we
> >find that special flag is set in a WAL record.  One idea could be to
> >use a flag bit in XLogRecord.xl_info.  If that is feasible then the
> >solution can work as it is now, without any overhead or change in the
> >way we maintain KnownAssignedXids.
> >
>
> Ummm, how is that different from what the patch is doing now? I mean, we
> only write the top-level XID for the first WAL record in each subxact,
> right? Or what would be the difference with your approach?
>

We have to do what the patch is currently doing and additionally, we
will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
us to call ProcArrayApplyXidAssignment during WAL replay only after
PGPROC_MAX_CACHED_SUBXIDS number of subxacts.  It will help us in
clearing the KnownAssignedXids at the same time as we do now, so no
additional performance overhead.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
>On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:
>> >On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> >How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
>> >flag in TransactionStateData and then log that as special information
>> >whenever we write next WAL record for a new subtransaction?  Then
>> >during recovery, we can only call ProcArrayApplyXidAssignment when we
>> >find that special flag is set in a WAL record.  One idea could be to
>> >use a flag bit in XLogRecord.xl_info.  If that is feasible then the
>> >solution can work as it is now, without any overhead or change in the
>> >way we maintain KnownAssignedXids.
>> >
>>
>> Ummm, how is that different from what the patch is doing now? I mean, we
>> only write the top-level XID for the first WAL record in each subxact,
>> right? Or what would be the difference with your approach?
>>
>
>We have to do what the patch is currently doing and additionally, we
>will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
>us to call ProcArrayApplyXidAssignment during WAL replay only after
>PGPROC_MAX_CACHED_SUBXIDS number of subxacts.  It will help us in
>clearing the KnownAssignedXids at the same time as we do now, so no
>additional performance overhead.
>

Hmmm. So we'd still log assignment twice? Or would we keep just the
immediate assignments (embedded into xlog records), and cache the
subxids on the replica somehow?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
> >On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> Ummm, how is that different from what the patch is doing now? I mean, we
> >> only write the top-level XID for the first WAL record in each subxact,
> >> right? Or what would be the difference with your approach?
> >>
> >
> >We have to do what the patch is currently doing and additionally, we
> >will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
> >us to call ProcArrayApplyXidAssignment during WAL replay only after
> >PGPROC_MAX_CACHED_SUBXIDS number of subxacts.  It will help us in
> >clearing the KnownAssignedXids at the same time as we do now, so no
> >additional performance overhead.
> >
>
> Hmmm. So we'd still log assignment twice? Or would we keep just the
> immediate assignments (embedded into xlog records), and cache the
> subxids on the replica somehow?
>

I think we need to cache the subxids on the replica somehow but I
don't have a very good idea for it.  Basically, there are two ways to
do it (a) Change the KnownAssignedXids in some way so that we can
easily find this information without losing on the current benefits of
it.  I can't think of a good way to do that and even if we come up
with something, it could easily be a lot of work, (b) Cache the
subxids for a particular transaction in local memory along with
KnownAssignedXids.  This is doable but now we have two data-structures
(one in shared memory and other in local memory) managing the same
information in different ways.

Do you have any other ideas?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
>On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
>> >On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> Ummm, how is that different from what the patch is doing now? I mean, we
>> >> only write the top-level XID for the first WAL record in each subxact,
>> >> right? Or what would be the difference with your approach?
>> >>
>> >
>> >We have to do what the patch is currently doing and additionally, we
>> >will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
>> >us to call ProcArrayApplyXidAssignment during WAL replay only after
>> >PGPROC_MAX_CACHED_SUBXIDS number of subxacts.  It will help us in
>> >clearing the KnownAssignedXids at the same time as we do now, so no
>> >additional performance overhead.
>> >
>>
>> Hmmm. So we'd still log assignment twice? Or would we keep just the
>> immediate assignments (embedded into xlog records), and cache the
>> subxids on the replica somehow?
>>
>
>I think we need to cache the subxids on the replica somehow but I
>don't have a very good idea for it.  Basically, there are two ways to
>do it (a) Change the KnownAssignedXids in some way so that we can
>easily find this information without losing on the current benefits of
>it.  I can't think of a good way to do that and even if we come up
>with something, it could easily be a lot of work, (b) Cache the
>subxids for a particular transaction in local memory along with
>KnownAssignedXids.  This is doable but now we have two data-structures
>(one in shared memory and other in local memory) managing the same
>information in different ways.
>
>Do you have any other ideas?

I don't follow. Why couldn't we have a simple cache on the standby? It
could be either a simple array or a hash table (with the top-level xid
as hash key)?

I think the single array would be sufficient, but the hash table would
allow keeping the apply logic more or less as it's today. See the
attached patch that adds such cache - I do admit I haven't tested this,
but hopefully it's a sufficient illustration of the idea.

It does not handle cleanup of the cache, but I think that should not be
difficult - we simply need to remove entries for transactions that got
committed or rolled back. And do something about transactions without an
explicit commit/rollback record, but that can be done by also handling
XLOG_RUNNING_XACTS (by removing anything preceding oldestRunningXid).

I don't think this is particularly complicated or a lot of code, and I
don't see why would it require data structures in shared memory. Only
the walreceiver on standby needs to worry about this, no?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
> >
> >I think we need to cache the subxids on the replica somehow but I
> >don't have a very good idea for it.  Basically, there are two ways to
> >do it (a) Change the KnownAssignedXids in some way so that we can
> >easily find this information without losing on the current benefits of
> >it.  I can't think of a good way to do that and even if we come up
> >with something, it could easily be a lot of work, (b) Cache the
> >subxids for a particular transaction in local memory along with
> >KnownAssignedXids.  This is doable but now we have two data-structures
> >(one in shared memory and other in local memory) managing the same
> >information in different ways.
> >
> >Do you have any other ideas?
>
> I don't follow. Why couldn't we have a simple cache on the standby? It
> could be either a simple array or a hash table (with the top-level xid
> as hash key)?
>

I think having something like we discussed or what you have in the
patch won't be sufficient to clean the KnownAssignedXid array. The
point is that we won't write a WAL for xid-subxid association for
unlogged relations in the "Immediately WAL-log assignments" patch,
however, the KnownAssignedXid would have both kinds of Xids as we
autofill it with gaps (see RecordKnownAssignedTransactionIds).  I
think if my understanding is correct to make it work we might need
major surgery in the code or have to maintain KnownAssignedXid array
differently.

>
> I don't think this is particularly complicated or a lot of code, and I
> don't see why would it require data structures in shared memory. Only
> the walreceiver on standby needs to worry about this, no?
>

Not a new data structure in shared memory, but we already have a
KnownTransactionId structure in shared memory. So, after having a
local cache, we will have xidAssignmentsHash and KnownTransactionId
maintaining the same information in different ways.  And, we need to
ensure both are cleaned up properly. That was what I was pointing
above related to maintaining two structures.  However, I think before
discussing more on this, we need to think about the above problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:
>On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
>> >
>> >I think we need to cache the subxids on the replica somehow but I
>> >don't have a very good idea for it.  Basically, there are two ways to
>> >do it (a) Change the KnownAssignedXids in some way so that we can
>> >easily find this information without losing on the current benefits of
>> >it.  I can't think of a good way to do that and even if we come up
>> >with something, it could easily be a lot of work, (b) Cache the
>> >subxids for a particular transaction in local memory along with
>> >KnownAssignedXids.  This is doable but now we have two data-structures
>> >(one in shared memory and other in local memory) managing the same
>> >information in different ways.
>> >
>> >Do you have any other ideas?
>>
>> I don't follow. Why couldn't we have a simple cache on the standby? It
>> could be either a simple array or a hash table (with the top-level xid
>> as hash key)?
>>
>
>I think having something like we discussed or what you have in the
>patch won't be sufficient to clean the KnownAssignedXid array. The
>point is that we won't write a WAL for xid-subxid association for
>unlogged relations in the "Immediately WAL-log assignments" patch,
>however, the KnownAssignedXid would have both kinds of Xids as we
>autofill it with gaps (see RecordKnownAssignedTransactionIds).  I
>think if my understanding is correct to make it work we might need
>major surgery in the code or have to maintain KnownAssignedXid array
>differently.

Hmm, that's a good point. If I understand correctly, the issue is
that if we create new subxact, write something into an unlogged table,
and then create new subxact, the XID of the first subxact will be "known
assigned" but we won't know it's a subxact or to which parent xact it
belongs (because there will be no WAL records that could encode it).

I wonder if there's a simple solution (e.g. when creating the second
subxact we might notice the xid-subxid assignment was not logged, and
write some "dummy" WAL record). But I admit it seems a bit ugly.

>>
>> I don't think this is particularly complicated or a lot of code, and I
>> don't see why would it require data structures in shared memory. Only
>> the walreceiver on standby needs to worry about this, no?
>>
>
>Not a new data structure in shared memory, but we already have a
>KnownTransactionId structure in shared memory. So, after having a
>local cache, we will have xidAssignmentsHash and KnownTransactionId
>maintaining the same information in different ways.  And, we need to
>ensure both are cleaned up properly. That was what I was pointing
>above related to maintaining two structures.  However, I think before
>discussing more on this, we need to think about the above problem.
>

Sure.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:
> >On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
> >> >
> >> >I think we need to cache the subxids on the replica somehow but I
> >> >don't have a very good idea for it.  Basically, there are two ways to
> >> >do it (a) Change the KnownAssignedXids in some way so that we can
> >> >easily find this information without losing on the current benefits of
> >> >it.  I can't think of a good way to do that and even if we come up
> >> >with something, it could easily be a lot of work, (b) Cache the
> >> >subxids for a particular transaction in local memory along with
> >> >KnownAssignedXids.  This is doable but now we have two data-structures
> >> >(one in shared memory and other in local memory) managing the same
> >> >information in different ways.
> >> >
> >> >Do you have any other ideas?
> >>
> >> I don't follow. Why couldn't we have a simple cache on the standby? It
> >> could be either a simple array or a hash table (with the top-level xid
> >> as hash key)?
> >>
> >
> >I think having something like we discussed or what you have in the
> >patch won't be sufficient to clean the KnownAssignedXid array. The
> >point is that we won't write a WAL for xid-subxid association for
> >unlogged relations in the "Immediately WAL-log assignments" patch,
> >however, the KnownAssignedXid would have both kinds of Xids as we
> >autofill it with gaps (see RecordKnownAssignedTransactionIds).  I
> >think if my understanding is correct to make it work we might need
> >major surgery in the code or have to maintain KnownAssignedXid array
> >differently.
>
> Hmm, that's a good point. If I understand correctly, the issue is
> that if we create new subxact, write something into an unlogged table,
> and then create new subxact, the XID of the first subxact will be "known
> assigned" but we won't know it's a subxact or to which parent xact it
> belongs (because there will be no WAL records that could encode it).
>
> I wonder if there's a simple solution (e.g. when creating the second
> subxact we might notice the xid-subxid assignment was not logged, and
> write some "dummy" WAL record). But I admit it seems a bit ugly.
>
> >>
> >> I don't think this is particularly complicated or a lot of code, and I
> >> don't see why would it require data structures in shared memory. Only
> >> the walreceiver on standby needs to worry about this, no?
> >>
> >
> >Not a new data structure in shared memory, but we already have a
> >KnownTransactionId structure in shared memory. So, after having a
> >local cache, we will have xidAssignmentsHash and KnownTransactionId
> >maintaining the same information in different ways.  And, we need to
> >ensure both are cleaned up properly. That was what I was pointing
> >above related to maintaining two structures.  However, I think before
> >discussing more on this, we need to think about the above problem.

I have rebased the patch on the latest head.  I haven't yet changed
anything for xid assignment thing because it is not yet concluded.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Kuntal Ghosh
Дата:
On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have rebased the patch on the latest head.  I haven't yet changed
> anything for xid assignment thing because it is not yet concluded.
>
Some review comments from 0001-Immediately-WAL-log-*.patch,

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}
IMHO, it's important to reduce the complexity of this function since
it's been called for every WAL insertion. During the lifespan of a
transaction, any of these if conditions will only be evaluated if
previous conditions are true. So, we can maintain some state machine
to avoid multiple evaluation of a condition inside a transaction. But,
if the overhead is not much, it's not worth I guess.

+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
This looks wrong. We should change the name of this Macro or we can
add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.

@@ -195,6 +197,10 @@ XLogResetInsertion(void)
 {
  int i;

+ /* reset the subxact assignment flag (if needed) */
+ if (curinsert_flags & XLOG_INCLUDE_XID)
+ MarkSubTransactionAssigned();
The comment looks contradictory.

 XLogSetRecordFlags(uint8 flags)
 {
  Assert(begininsert_called);
- curinsert_flags = flags;
+ curinsert_flags |= flags;
 }
 I didn't understand why we need this change in this patch.

+ txid = XLogRecGetTopXid(record);
+
+ /*
+ * If the toplevel_xid is valid, we need to assign the subxact to the
+ * toplevel transaction. We need to do this for all records, hence we
+ * do it before the switch.
+ */
s/toplevel_xid/toplevel xid or s/toplevel_xid/txid

  if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(r->toplevel_xid))
Perhaps, XLogRecGetTopXid() can be used.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have rebased the patch on the latest head.  I haven't yet changed
> > anything for xid assignment thing because it is not yet concluded.
> >
> Some review comments from 0001-Immediately-WAL-log-*.patch,
>
> +bool
> +IsSubTransactionAssignmentPending(void)
> +{
> + if (!XLogLogicalInfoActive())
> + return false;
> +
> + /* we need to be in a transaction state */
> + if (!IsTransactionState())
> + return false;
> +
> + /* it has to be a subtransaction */
> + if (!IsSubTransaction())
> + return false;
> +
> + /* the subtransaction has to have a XID assigned */
> + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
> + return false;
> +
> + /* and it needs to have 'assigned' */
> + return !CurrentTransactionState->assigned;
> +
> +}
> IMHO, it's important to reduce the complexity of this function since
> it's been called for every WAL insertion. During the lifespan of a
> transaction, any of these if conditions will only be evaluated if
> previous conditions are true. So, we can maintain some state machine
> to avoid multiple evaluation of a condition inside a transaction. But,
> if the overhead is not much, it's not worth I guess.

Yeah maybe, in some cases we can avoid checking multiple conditions by
maintaining that state.  But, that state will have to be at the
transaction level.  But, I am not sure how much worth it will be to
add one extra condition to skip a few if checks and it will also add
the code complexity.  And, in some cases where logical decoding is not
enabled, it may add one extra check? I mean first check the state and
that will take you to the first if check.

>
> +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> This looks wrong. We should change the name of this Macro or we can
> add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.

I think this is in sync with below code (SizeOfXlogOrigin),  SO doen't
make much sense to add different terminology no?
#define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
+#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))

>
> @@ -195,6 +197,10 @@ XLogResetInsertion(void)
>  {
>   int i;
>
> + /* reset the subxact assignment flag (if needed) */
> + if (curinsert_flags & XLOG_INCLUDE_XID)
> + MarkSubTransactionAssigned();
> The comment looks contradictory.
>
>  XLogSetRecordFlags(uint8 flags)
>  {
>   Assert(begininsert_called);
> - curinsert_flags = flags;
> + curinsert_flags |= flags;
>  }
>  I didn't understand why we need this change in this patch.

I think it's changed so that below code can use it, but we have
directly set the flag.  I think I will change in the next version.

@@ -748,6 +754,18 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
  scratch += sizeof(replorigin_session_origin);
  }

+ /* followed by toplevel XID, if not already included in previous record */
+ if (IsSubTransactionAssignmentPending())
+ {
+ TransactionId xid = GetTopTransactionIdIfAny();
+
+ /* update the flag (later used by XLogInsertRecord) */
+ curinsert_flags |= XLOG_INCLUDE_XID;

>
> + txid = XLogRecGetTopXid(record);
> +
> + /*
> + * If the toplevel_xid is valid, we need to assign the subxact to the
> + * toplevel transaction. We need to do this for all records, hence we
> + * do it before the switch.
> + */
> s/toplevel_xid/toplevel xid or s/toplevel_xid/txid

Okay, we can change

>   if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
> - info != XLOG_XACT_ASSIGNMENT)
> + !TransactionIdIsValid(r->toplevel_xid))
> Perhaps, XLogRecGetTopXid() can be used.

ok

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Kuntal Ghosh
Дата:
On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> >
> > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > This looks wrong. We should change the name of this Macro or we can
> > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
>
> I think this is in sync with below code (SizeOfXlogOrigin),  SO doen't
> make much sense to add different terminology no?
> #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
>
In that case, we can rename this, for example, SizeOfXLogTransactionId.

Some review comments from 0002-Issue-individual-*.path,

+void
+ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr lsn, int nmsgs,
+ SharedInvalidationMessage *msgs)
+{
+ MemoryContext oldcontext;
+ ReorderBufferChange *change;
+
+ /* XXX Should we even write invalidations without valid XID? */
+ if (xid == InvalidTransactionId)
+ return;
+
+ Assert(xid != InvalidTransactionId);

It seems we don't call the function if xid is not valid. In fact,

@@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
  }
  case XLOG_XACT_ASSIGNMENT:
  break;
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ if (!TransactionIdIsValid(xid))
+ break;
+
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Why should we insert an WAL record for such cases?

+ * When wal_level=logical, write invalidations into WAL at each command end to
+ *  support the decoding of the in-progress transaction.  As of now it was
+ *  enough to log invalidation only at commit because we are only decoding the
+ *  transaction at the commit time.   We only need to log the catalog cache and
+ *  relcache invalidation.  There can not be any active MVCC scan in logical
+ *  decoding so we don't need to log the snapshot invalidation.
The alignment is not right.

 /*
  * CommandEndInvalidationMessages
- * Process queued-up invalidation messages at end of one command
- * in a transaction.
+ *              Process queued-up invalidation messages at end of one command
+ *              in a transaction.
Looks unnecessary changes.

  * Note:
- * This should be called during CommandCounterIncrement(),
- * after we have advanced the command ID.
+ *              This should be called during CommandCounterIncrement(),
+ *              after we have advanced the command ID.
  */
Looks unnecessary changes.

  if (transInvalInfo == NULL)
- return;
+ return;
Looks unnecessary changes.

+ /* prepare record */
+ memset(&xlrec, 0, sizeof(xlrec));
We should use MinSizeOfXactInvalidations, no?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > >
> > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > > This looks wrong. We should change the name of this Macro or we can
> > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
> >
> > I think this is in sync with below code (SizeOfXlogOrigin),  SO doen't
> > make much sense to add different terminology no?
> > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> >
> In that case, we can rename this, for example, SizeOfXLogTransactionId.

Make sense.

>
> Some review comments from 0002-Issue-individual-*.path,
>
> +void
> +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
> + XLogRecPtr lsn, int nmsgs,
> + SharedInvalidationMessage *msgs)
> +{
> + MemoryContext oldcontext;
> + ReorderBufferChange *change;
> +
> + /* XXX Should we even write invalidations without valid XID? */
> + if (xid == InvalidTransactionId)
> + return;
> +
> + Assert(xid != InvalidTransactionId);
>
> It seems we don't call the function if xid is not valid. In fact,
>
> @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf)
>   }
>   case XLOG_XACT_ASSIGNMENT:
>   break;
> + case XLOG_XACT_INVALIDATIONS:
> + {
> + TransactionId xid;
> + xl_xact_invalidations *invals;
> +
> + xid = XLogRecGetXid(r);
> + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> +
> + if (!TransactionIdIsValid(xid))
> + break;
> +
> + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> + invals->nmsgs, invals->msgs);
>
> Why should we insert an WAL record for such cases?

I think we can avoid this.  I will analyze and send update in my next patch.

>
> + * When wal_level=logical, write invalidations into WAL at each command end to
> + *  support the decoding of the in-progress transaction.  As of now it was
> + *  enough to log invalidation only at commit because we are only decoding the
> + *  transaction at the commit time.   We only need to log the catalog cache and
> + *  relcache invalidation.  There can not be any active MVCC scan in logical
> + *  decoding so we don't need to log the snapshot invalidation.
> The alignment is not right.

Will fix.

>  /*
>   * CommandEndInvalidationMessages
> - * Process queued-up invalidation messages at end of one command
> - * in a transaction.
> + *              Process queued-up invalidation messages at end of one command
> + *              in a transaction.
> Looks unnecessary changes.

Will fix.

>
>   * Note:
> - * This should be called during CommandCounterIncrement(),
> - * after we have advanced the command ID.
> + *              This should be called during CommandCounterIncrement(),
> + *              after we have advanced the command ID.
>   */
> Looks unnecessary changes.

Will fix.

>   if (transInvalInfo == NULL)
> - return;
> + return;
> Looks unnecessary changes.
>
> + /* prepare record */
> + memset(&xlrec, 0, sizeof(xlrec));
> We should use MinSizeOfXactInvalidations, no?

Right.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Kuntal Ghosh
Дата:
On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch

@@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation,
ItemPointer tid)
  ItemId lp = NULL;
  HeapTupleHeader htup;

+ /*
+ * We don't expect direct calls to heap_hot_search with
+ * valid CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_hot_search call during logical decoding");
The call is to heap_finish_speculative.

@@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan)
  }
  }

+ if (TransactionIdIsValid(CheckXidAlive) &&
+ !TransactionIdIsInProgress(CheckXidAlive) &&
+ !TransactionIdDidCommit(CheckXidAlive))
+ ereport(ERROR,
+ (errcode(ERRCODE_TRANSACTION_ROLLBACK),
+ errmsg("transaction aborted during system catalog scan")));
s/transaction aborted/transaction aborted concurrently perhaps? Also,
can we move this check at the begining of the function? If the
condition fails, we can skip the sys scan.

Some of the checks looks repetative in the same file. Should we
declare them as inline functions?

Review comments from 0005-Implement-streaming*.patch

+static void
+AssertChangeLsnOrder(ReorderBufferTXN *txn)
+{
+#ifdef USE_ASSERT_CHECKING
+ dlist_iter iter;
...
+#endif
+}

We can implement the same as following:
#ifdef USE_ASSERT_CHECKING
static void
AssertChangeLsnOrder(ReorderBufferTXN *txn)
{
dlist_iter iter;
...
}
#else
#define AssertChangeLsnOrder(txn) ((void)true)
#endif

+ * if it is aborted we will report an specific error which we can ignore.  We
s/an specific/a specific

+ * Set the last last of the stream as the final lsn before calling
+ * stream stop.
s/last last/last

  PG_CATCH();
  {
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData  *errdata = CopyErrorData();
When we don't re-throw, the errdata should be freed by calling
FreeErrorData(errdata), right?

+ /*
+ * Set the last last of the stream as the final lsn before
+ * calling stream stop.
+ */
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+
+ FlushErrorState();
+ }
stream_stop() can still throw some error, right? In that case, we
should flush the error state before calling stream_stop().

+ /*
+ * Remember the command ID and snapshot if transaction is streaming
+ * otherwise free the snapshot if we have copied it.
+ */
+ if (streaming)
+ {
+ txn->command_id = command_id;
+
+ /* Avoid copying if it's already copied. */
+ if (snapshot_now->copied)
+ txn->snapshot_now = snapshot_now;
+ else
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+  txn, command_id);
+ }
+ else if (snapshot_now->copied)
+ ReorderBufferFreeSnap(rb, snapshot_now);
Hmm, it seems this part needs an assumption that after copying the
snapshot, no subsequent step can throw any error. If they do, then we
can again create a copy of the snapshot in catch block, which will
leak some memory. Is my understanding correct?

+ }
+ else
+ {
+ ReorderBufferCleanupTXN(rb, txn);
+ PG_RE_THROW();
+ }
Shouldn't we switch back to previously created error memory context
before re-throwing?

+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn)
+{
+ ReorderBufferTXN *txn;
+ volatile Snapshot snapshot_now;
+ volatile CommandId command_id = FirstCommandId;
In the modified ReorderBufferCommit(), why is it necessary to declare
the above two variable as volatile? There is no try-catch block here.

@@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
  if (txn == NULL)
  return;

+ /*
+ * When the (sub)transaction was streamed, notify the remote node
+ * about the abort only if we have sent any data for this transaction.
+ */
+ if (rbtxn_is_streamed(txn) && txn->any_data_sent)
+ rb->stream_abort(rb, txn, lsn);
+
s/When/If

+ /*
+ * When the (sub)transaction was streamed, notify the remote node
+ * about the abort.
+ */
+ if (rbtxn_is_streamed(txn))
+ rb->stream_abort(rb, txn, lsn);
s/When/If. And, in this case, if we've not sent any data, why should
we send the abort message (similar to the previous one)?

+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)
Should we put any assert (not necessarily here) to validate the above comment?

+ txn = ReorderBufferLargestTopTXN(rb);
+
+ /* we know there has to be one, because the size is not zero */
+ Assert(txn && !txn->toptxn);
+ Assert(txn->size > 0);
+ Assert(rb->size >= txn->size);
The same three assertions are already there in ReorderBufferLargestTopTXN().

+static bool
+ReorderBufferCanStream(ReorderBuffer *rb)
+{
+ LogicalDecodingContext *ctx = rb->private_data;
+
+ return ctx->streaming;
+}
Potential inline function.

+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+{
+ volatile Snapshot snapshot_now;
+ volatile CommandId command_id;
Here also, do we need to declare these two variables as volatile?


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:
> >On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >
> >I think having something like we discussed or what you have in the
> >patch won't be sufficient to clean the KnownAssignedXid array. The
> >point is that we won't write a WAL for xid-subxid association for
> >unlogged relations in the "Immediately WAL-log assignments" patch,
> >however, the KnownAssignedXid would have both kinds of Xids as we
> >autofill it with gaps (see RecordKnownAssignedTransactionIds).  I
> >think if my understanding is correct to make it work we might need
> >major surgery in the code or have to maintain KnownAssignedXid array
> >differently.
>
> Hmm, that's a good point. If I understand correctly, the issue is
> that if we create new subxact, write something into an unlogged table,
> and then create new subxact, the XID of the first subxact will be "known
> assigned" but we won't know it's a subxact or to which parent xact it
> belongs (because there will be no WAL records that could encode it).
>

Yeah, there could be multiple such missing subxacts.

> I wonder if there's a simple solution (e.g. when creating the second
> subxact we might notice the xid-subxid assignment was not logged, and
> write some "dummy" WAL record).
>

That WAL record can have multiple xids.

> But I admit it seems a bit ugly.
>

Yeah, I guess it could be tricky as well because while assembling some
WAL record, we need to generate an additional dummy record or might
need to add additional information to the current record being formed.
I think the handling of such WAL records during hot-standby and in
logical decoding could vary.  During logical decoding, currently, we
don't form an association for subtransactions if it doesn't have any
changes (see ReorderBufferCommitChild) and now with this new type of
record, I think we need to ensure that we don't form such association.

I think after quite some changes, tweaks and a lot of testing, we
might be able to remove XLOG_XACT_ASSIGNMENT but I am not sure if it
is worth doing along with this patch.  I think it would have been good
to do this if we are adding any visible overhead with this patch and
or it is easy to do that.  However, none of that seems to be true, so
it might be better to write good comments in the code indicating what
all we need to do to remove XLOG_XACT_ASSIGNMENT so that if we feel it
is important to do in future we can do so.  I am not against spending
effort on this but I don't see the urgency of doing it along with this
patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > >
> > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > > This looks wrong. We should change the name of this Macro or we can
> > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
> >
> > I think this is in sync with below code (SizeOfXlogOrigin),  SO doen't
> > make much sense to add different terminology no?
> > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> >
> In that case, we can rename this, for example, SizeOfXLogTransactionId.
>
> Some review comments from 0002-Issue-individual-*.path,
>
> +void
> +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
> + XLogRecPtr lsn, int nmsgs,
> + SharedInvalidationMessage *msgs)
> +{
> + MemoryContext oldcontext;
> + ReorderBufferChange *change;
> +
> + /* XXX Should we even write invalidations without valid XID? */
> + if (xid == InvalidTransactionId)
> + return;
> +
> + Assert(xid != InvalidTransactionId);
>
> It seems we don't call the function if xid is not valid. In fact,
>

You have a valid point.  Also, it is not clear if we are first
checking (xid == InvalidTransactionId) and returning from the
function, how can even Assert hit.

> @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf)
>   }
>   case XLOG_XACT_ASSIGNMENT:
>   break;
> + case XLOG_XACT_INVALIDATIONS:
> + {
> + TransactionId xid;
> + xl_xact_invalidations *invals;
> +
> + xid = XLogRecGetXid(r);
> + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> +
> + if (!TransactionIdIsValid(xid))
> + break;
> +
> + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> + invals->nmsgs, invals->msgs);
>
> Why should we insert an WAL record for such cases?
>

Right, if there is any such case, we should avoid it.

One more point about this patch, the commit message needs to be updated:

> The new invalidations are written to WAL immediately, without any
such caching. Perhaps it would be possible to add similar caching,
> e.g. at the command level, or something like that?

I think the above part of commit message is not right as the patch
already does such a caching now at the command level.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Apr 13, 2020 at 11:43 PM Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
>
> On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch
>
> @@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation,
> ItemPointer tid)
>   ItemId lp = NULL;
>   HeapTupleHeader htup;
>
> + /*
> + * We don't expect direct calls to heap_hot_search with
> + * valid CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> + elog(ERROR, "unexpected heap_hot_search call during logical decoding");
> The call is to heap_finish_speculative.

Fixed

> @@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan)
>   }
>   }
>
> + if (TransactionIdIsValid(CheckXidAlive) &&
> + !TransactionIdIsInProgress(CheckXidAlive) &&
> + !TransactionIdDidCommit(CheckXidAlive))
> + ereport(ERROR,
> + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> + errmsg("transaction aborted during system catalog scan")));
> s/transaction aborted/transaction aborted concurrently perhaps? Also,
> can we move this check at the begining of the function? If the
> condition fails, we can skip the sys scan.

We must check this after we get the tuple because our goal is, not to
decode based on the wrong tuple.  And, if we move the check before
then, what if the transaction aborted after the check.   Once we get
the tuple and if the transaction is alive by that time then it doesn't
matter even if it aborts because we have got the right tuple already.

>
> Some of the checks looks repetative in the same file. Should we
> declare them as inline functions?
>
> Review comments from 0005-Implement-streaming*.patch
>
> +static void
> +AssertChangeLsnOrder(ReorderBufferTXN *txn)
> +{
> +#ifdef USE_ASSERT_CHECKING
> + dlist_iter iter;
> ...
> +#endif
> +}
>
> We can implement the same as following:
> #ifdef USE_ASSERT_CHECKING
> static void
> AssertChangeLsnOrder(ReorderBufferTXN *txn)
> {
> dlist_iter iter;
> ...
> }
> #else
> #define AssertChangeLsnOrder(txn) ((void)true)
> #endif

I am not sure, this doesn't look clean.  Moreover, the other similar
functions are defined in the same way. e.g. AssertTXNLsnOrder.

>
> + * if it is aborted we will report an specific error which we can ignore.  We
> s/an specific/a specific

Done

>
> + * Set the last last of the stream as the final lsn before calling
> + * stream stop.
> s/last last/last
>
>   PG_CATCH();
>   {
> + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> + ErrorData  *errdata = CopyErrorData();
> When we don't re-throw, the errdata should be freed by calling
> FreeErrorData(errdata), right?

Done


>
> + /*
> + * Set the last last of the stream as the final lsn before
> + * calling stream stop.
> + */
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> +
> + FlushErrorState();
> + }
> stream_stop() can still throw some error, right? In that case, we
> should flush the error state before calling stream_stop().

Done

>
> + /*
> + * Remember the command ID and snapshot if transaction is streaming
> + * otherwise free the snapshot if we have copied it.
> + */
> + if (streaming)
> + {
> + txn->command_id = command_id;
> +
> + /* Avoid copying if it's already copied. */
> + if (snapshot_now->copied)
> + txn->snapshot_now = snapshot_now;
> + else
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +  txn, command_id);
> + }
> + else if (snapshot_now->copied)
> + ReorderBufferFreeSnap(rb, snapshot_now);
> Hmm, it seems this part needs an assumption that after copying the
> snapshot, no subsequent step can throw any error. If they do, then we
> can again create a copy of the snapshot in catch block, which will
> leak some memory. Is my understanding correct?

Actually, In CATCH we copy only if the error is
ERRCODE_TRANSACTION_ROLLBACK.  And, that can occur during systable
scan.  Basically, in TRY block we copy snapshot after we have streamed
all the changes i.e. systable scan is done, now if there is any error
that will not be ERRCODE_TRANSACTION_ROLLBACK.  So we will not copy
again.

>
> + }
> + else
> + {
> + ReorderBufferCleanupTXN(rb, txn);
> + PG_RE_THROW();
> + }
> Shouldn't we switch back to previously created error memory context
> before re-throwing?

Fixed.

>
> +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> + XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
> + TimestampTz commit_time,
> + RepOriginId origin_id, XLogRecPtr origin_lsn)
> +{
> + ReorderBufferTXN *txn;
> + volatile Snapshot snapshot_now;
> + volatile CommandId command_id = FirstCommandId;
> In the modified ReorderBufferCommit(), why is it necessary to declare
> the above two variable as volatile? There is no try-catch block here.

Fixed
>
> @@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
> TransactionId xid, XLogRecPtr lsn)
>   if (txn == NULL)
>   return;
>
> + /*
> + * When the (sub)transaction was streamed, notify the remote node
> + * about the abort only if we have sent any data for this transaction.
> + */
> + if (rbtxn_is_streamed(txn) && txn->any_data_sent)
> + rb->stream_abort(rb, txn, lsn);
> +
> s/When/If
>
> + /*
> + * When the (sub)transaction was streamed, notify the remote node
> + * about the abort.
> + */
> + if (rbtxn_is_streamed(txn))
> + rb->stream_abort(rb, txn, lsn);
> s/When/If. And, in this case, if we've not sent any data, why should
> we send the abort message (similar to the previous one)?

Fixed

>
> + * Note: We never do both stream and serialize a transaction (we only spill
> + * to disk when streaming is not supported by the plugin), so only one of
> + * those two flags may be set at any given time.
> + */
> +#define rbtxn_is_streamed(txn) \
> +( \
> + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
> +)
> Should we put any assert (not necessarily here) to validate the above comment?

Because of toast handling, this assumption is changed now so I will
remove this note in that patch (0010).

>
> + txn = ReorderBufferLargestTopTXN(rb);
> +
> + /* we know there has to be one, because the size is not zero */
> + Assert(txn && !txn->toptxn);
> + Assert(txn->size > 0);
> + Assert(rb->size >= txn->size);
> The same three assertions are already there in ReorderBufferLargestTopTXN().
>
> +static bool
> +ReorderBufferCanStream(ReorderBuffer *rb)
> +{
> + LogicalDecodingContext *ctx = rb->private_data;
> +
> + return ctx->streaming;
> +}
> Potential inline function.

Done

> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +{
> + volatile Snapshot snapshot_now;
> + volatile CommandId command_id;
> Here also, do we need to declare these two variables as volatile?

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Apr 14, 2020 at 2:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Apr 13, 2020 at 6:12 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> >
> > On Mon, Apr 13, 2020 at 5:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > > >
> > > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > > > This looks wrong. We should change the name of this Macro or we can
> > > > add the 1 byte directly in HEADER_SCRATCH_SIZE and some comments.
> > >
> > > I think this is in sync with below code (SizeOfXlogOrigin),  SO doen't
> > > make much sense to add different terminology no?
> > > #define SizeOfXlogOrigin (sizeof(RepOriginId) + sizeof(char))
> > > +#define SizeOfTransactionId (sizeof(TransactionId) + sizeof(char))
> > >
> > In that case, we can rename this, for example, SizeOfXLogTransactionId.
> >
> > Some review comments from 0002-Issue-individual-*.path,
> >
> > +void
> > +ReorderBufferAddInvalidation(ReorderBuffer *rb, TransactionId xid,
> > + XLogRecPtr lsn, int nmsgs,
> > + SharedInvalidationMessage *msgs)
> > +{
> > + MemoryContext oldcontext;
> > + ReorderBufferChange *change;
> > +
> > + /* XXX Should we even write invalidations without valid XID? */
> > + if (xid == InvalidTransactionId)
> > + return;
> > +
> > + Assert(xid != InvalidTransactionId);
> >
> > It seems we don't call the function if xid is not valid. In fact,
> >
>
> You have a valid point.  Also, it is not clear if we are first
> checking (xid == InvalidTransactionId) and returning from the
> function, how can even Assert hit.

I have changed to code, now we only have an assert.

>
> > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> > XLogRecordBuffer *buf)
> >   }
> >   case XLOG_XACT_ASSIGNMENT:
> >   break;
> > + case XLOG_XACT_INVALIDATIONS:
> > + {
> > + TransactionId xid;
> > + xl_xact_invalidations *invals;
> > +
> > + xid = XLogRecGetXid(r);
> > + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> > +
> > + if (!TransactionIdIsValid(xid))
> > + break;
> > +
> > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> > + invals->nmsgs, invals->msgs);
> >
> > Why should we insert an WAL record for such cases?
> >
>
> Right, if there is any such case, we should avoid it.

I think we don't have any such case because we are logging at the
command end.  So I have created an assert instead of the check.

> One more point about this patch, the commit message needs to be updated:
>
> > The new invalidations are written to WAL immediately, without any
> such caching. Perhaps it would be possible to add similar caching,
> > e.g. at the command level, or something like that?
>
> I think the above part of commit message is not right as the patch
> already does such a caching now at the command level.

Right, I have removed that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> >
> > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> > > XLogRecordBuffer *buf)
> > >   }
> > >   case XLOG_XACT_ASSIGNMENT:
> > >   break;
> > > + case XLOG_XACT_INVALIDATIONS:
> > > + {
> > > + TransactionId xid;
> > > + xl_xact_invalidations *invals;
> > > +
> > > + xid = XLogRecGetXid(r);
> > > + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> > > +
> > > + if (!TransactionIdIsValid(xid))
> > > + break;
> > > +
> > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> > > + invals->nmsgs, invals->msgs);
> > >
> > > Why should we insert an WAL record for such cases?
> > >
> >
> > Right, if there is any such case, we should avoid it.
>
> I think we don't have any such case because we are logging at the
> command end.  So I have created an assert instead of the check.
>

Have you tried to ensure this in some way?  One idea could be to add
an Assert (to check if transaction id is assigned) in the new code
where you are writing WAL for this action and then run make
check-world and or make installcheck-world.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Apr 14, 2020 at 3:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > >
> > > > @@ -281,6 +281,24 @@ DecodeXactOp(LogicalDecodingContext *ctx,
> > > > XLogRecordBuffer *buf)
> > > >   }
> > > >   case XLOG_XACT_ASSIGNMENT:
> > > >   break;
> > > > + case XLOG_XACT_INVALIDATIONS:
> > > > + {
> > > > + TransactionId xid;
> > > > + xl_xact_invalidations *invals;
> > > > +
> > > > + xid = XLogRecGetXid(r);
> > > > + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> > > > +
> > > > + if (!TransactionIdIsValid(xid))
> > > > + break;
> > > > +
> > > > + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> > > > + invals->nmsgs, invals->msgs);
> > > >
> > > > Why should we insert an WAL record for such cases?
> > > >
> > >
> > > Right, if there is any such case, we should avoid it.
> >
> > I think we don't have any such case because we are logging at the
> > command end.  So I have created an assert instead of the check.
> >
>
> Have you tried to ensure this in some way?  One idea could be to add
> an Assert (to check if transaction id is assigned) in the new code
> where you are writing WAL for this action and then run make
> check-world and or make installcheck-world.

Yeah, I had already tested that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2020-04-14 12:10, Dilip Kumar wrote:

> v14-0001-Immediately-WAL-log-assignments.patch                 +
> v14-0002-Issue-individual-invalidations-with.patch             +
> v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch       +
> v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> v14-0007-Track-statistics-for-streaming.patch                  +
> v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              +
> v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch

applied on top of 8128b0c (a few hours ago)

Hi,

I haven't followed this thread and maybe this instabilty is 
known/expected; just thought I'd let you know.

When doing running a pgbench run over logical replication (cascading 
down two replicas), I get this segmentation fault.

2020-04-14 17:27:28.135 CEST [8118] DETAIL:  Streaming transactions 
committing after 0/5FA2A38, reading WAL from 0/5FA2A00.
2020-04-14 17:27:28.135 CEST [8118] LOG:  logical decoding found 
consistent point at 0/5FA2A00
2020-04-14 17:27:28.135 CEST [8118] DETAIL:  There are no running 
transactions.
2020-04-14 17:27:28.138 CEST [8006] LOG:  server process (PID 8118) was 
terminated by signal 11: Segmentation fault
2020-04-14 17:27:28.138 CEST [8006] DETAIL:  Failed process was running: 
COMMIT
2020-04-14 17:27:28.138 CEST [8006] LOG:  terminating any other active 
server processes
2020-04-14 17:27:28.138 CEST [8163] WARNING:  terminating connection 
because of crash of another server process
2020-04-14 17:27:28.138 CEST [8163] DETAIL:  The postmaster has 
commanded this server process to roll back the current transaction and 
exit, because another server process exited abnormally and possibly 
corrupted shared memory.
2020-04-14 17:27:28.138 CEST [8163] HINT:  In a moment you should be 
able to reconnect to the database and repeat your command.


This error happens somewhat buried away in my test-stuff; I can dig it 
out and make it into a repeatable test if you need it. (debian 
stretch/gcc 9.3.0)


Erik Rijkers




Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-14 12:10, Dilip Kumar wrote:
>
> > v14-0001-Immediately-WAL-log-assignments.patch                 +
> > v14-0002-Issue-individual-invalidations-with.patch             +
> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch       +
> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> > v14-0007-Track-statistics-for-streaming.patch                  +
> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              +
> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>
> applied on top of 8128b0c (a few hours ago)
>
> Hi,
>
> I haven't followed this thread and maybe this instabilty is
> known/expected; just thought I'd let you know.
>
> When doing running a pgbench run over logical replication (cascading
> down two replicas), I get this segmentation fault.

Thanks for the testing.  Is it possible to share the call stack?

>
> 2020-04-14 17:27:28.135 CEST [8118] DETAIL:  Streaming transactions
> committing after 0/5FA2A38, reading WAL from 0/5FA2A00.
> 2020-04-14 17:27:28.135 CEST [8118] LOG:  logical decoding found
> consistent point at 0/5FA2A00
> 2020-04-14 17:27:28.135 CEST [8118] DETAIL:  There are no running
> transactions.
> 2020-04-14 17:27:28.138 CEST [8006] LOG:  server process (PID 8118) was
> terminated by signal 11: Segmentation fault
> 2020-04-14 17:27:28.138 CEST [8006] DETAIL:  Failed process was running:
> COMMIT
> 2020-04-14 17:27:28.138 CEST [8006] LOG:  terminating any other active
> server processes
> 2020-04-14 17:27:28.138 CEST [8163] WARNING:  terminating connection
> because of crash of another server process
> 2020-04-14 17:27:28.138 CEST [8163] DETAIL:  The postmaster has
> commanded this server process to roll back the current transaction and
> exit, because another server process exited abnormally and possibly
> corrupted shared memory.
> 2020-04-14 17:27:28.138 CEST [8163] HINT:  In a moment you should be
> able to reconnect to the database and repeat your command.
>
>
> This error happens somewhat buried away in my test-stuff; I can dig it
> out and make it into a repeatable test if you need it. (debian
> stretch/gcc 9.3.0)

Yeah, that will be great.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-14 12:10, Dilip Kumar wrote:
>
> > v14-0001-Immediately-WAL-log-assignments.patch                 +
> > v14-0002-Issue-individual-invalidations-with.patch             +
> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch       +
> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> > v14-0007-Track-statistics-for-streaming.patch                  +
> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              +
> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>
> applied on top of 8128b0c (a few hours ago)


Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1].  After fixing that I have got one crash with streaming
on patch.  I am not sure whether you are facing any of these 2 issues
or any other issue.  If your issue is not any of these then plese
share the callstack and steps to reproduce.

[1] https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2020-04-16 11:33, Dilip Kumar wrote:
> On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:
>> 
>> On 2020-04-14 12:10, Dilip Kumar wrote:
>> 
>> > v14-0001-Immediately-WAL-log-assignments.patch                 +
>> > v14-0002-Issue-individual-invalidations-with.patch             +
>> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
>> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
>> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch       +
>> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
>> > v14-0007-Track-statistics-for-streaming.patch                  +
>> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
>> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              +
>> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>> 
>> applied on top of 8128b0c (a few hours ago)
> 

I've added your new patch

[bugfix_replica_identity_full_on_subscriber.patch]

on top of all those above but the crash (apparently the same crash) that 
I had earlier still occurs (and pretty soon).

server process (PID 1721) was terminated by signal 11: Segmentation 
fault

I'll try to isolate it better and get a stacktrace


> Hi Erik,
> 
> While setting up the cascading replication I have hit one issue on
> base code[1].  After fixing that I have got one crash with streaming
> on patch.  I am not sure whether you are facing any of these 2 issues
> or any other issue.  If your issue is not any of these then plese
> share the callstack and steps to reproduce.
> 
> [1]
> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
> 
> 
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Kuntal Ghosh
Дата:
On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>

Few review comments from 0006-Add-support-for-streaming*.patch

+ subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
lseek can return (-)ve value in case of error, right?

+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ *
+ * Don't check for error from mkdir; it could fail if the directory
+ * already exists (maybe someone else just did the same thing).  If
+ * it doesn't work then we'll bomb out when opening the file
+ */
+ mkdir(tempdirpath, S_IRWXU);
If that's the only reason, perhaps we can use something like following:

if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
throw error;

+
+ CloseTransientFile(stream_fd);
Might failed to close the file. We should handle the case.

Also, I think we need some implementations in dumpSubscription() to
dump the (streaming = 'on') option.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Tomas Vondra
Дата:
On Mon, Apr 13, 2020 at 05:20:39PM +0530, Dilip Kumar wrote:
>On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>
>> On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > I have rebased the patch on the latest head.  I haven't yet changed
>> > anything for xid assignment thing because it is not yet concluded.
>> >
>> Some review comments from 0001-Immediately-WAL-log-*.patch,
>>
>> +bool
>> +IsSubTransactionAssignmentPending(void)
>> +{
>> + if (!XLogLogicalInfoActive())
>> + return false;
>> +
>> + /* we need to be in a transaction state */
>> + if (!IsTransactionState())
>> + return false;
>> +
>> + /* it has to be a subtransaction */
>> + if (!IsSubTransaction())
>> + return false;
>> +
>> + /* the subtransaction has to have a XID assigned */
>> + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
>> + return false;
>> +
>> + /* and it needs to have 'assigned' */
>> + return !CurrentTransactionState->assigned;
>> +
>> +}
>> IMHO, it's important to reduce the complexity of this function since
>> it's been called for every WAL insertion. During the lifespan of a
>> transaction, any of these if conditions will only be evaluated if
>> previous conditions are true. So, we can maintain some state machine
>> to avoid multiple evaluation of a condition inside a transaction. But,
>> if the overhead is not much, it's not worth I guess.
>
>Yeah maybe, in some cases we can avoid checking multiple conditions by
>maintaining that state.  But, that state will have to be at the
>transaction level.  But, I am not sure how much worth it will be to
>add one extra condition to skip a few if checks and it will also add
>the code complexity.  And, in some cases where logical decoding is not
>enabled, it may add one extra check? I mean first check the state and
>that will take you to the first if check.
>

Perhaps. I think we should only do that if we can demonstrate it's an
issue in practice. Otherwise it's just unnecessary complexity.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2020-04-16 11:46, Erik Rijkers wrote:
> On 2020-04-16 11:33, Dilip Kumar wrote:
>> On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:
>>> 
>>> On 2020-04-14 12:10, Dilip Kumar wrote:
>>> 
>>> > v14-0001-Immediately-WAL-log-assignments.patch                 +
>>> > v14-0002-Issue-individual-invalidations-with.patch             +
>>> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
>>> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
>>> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch       +
>>> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
>>> > v14-0007-Track-statistics-for-streaming.patch                  +
>>> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
>>> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              +
>>> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>>> 
>>> applied on top of 8128b0c (a few hours ago)
>> 
> 
> I've added your new patch
> 
> [bugfix_replica_identity_full_on_subscriber.patch]
> 
> on top of all those above but the crash (apparently the same crash)
> that I had earlier still occurs (and pretty soon).
> 
> server process (PID 1721) was terminated by signal 11: Segmentation 
> fault
> 
> I'll try to isolate it better and get a stacktrace
> 
> 
>> Hi Erik,
>> 
>> While setting up the cascading replication I have hit one issue on
>> base code[1].  After fixing that I have got one crash with streaming
>> on patch.  I am not sure whether you are facing any of these 2 issues
>> or any other issue.  If your issue is not any of these then plese
>> share the callstack and steps to reproduce.

I figured out a few things about this. Attached is a bash script 
test.sh, to reproduce:

There is a variable  CRASH_IT  that determines whether the whole thing 
will fail (with a segmentation fault) or not.  As attached it has  
CRASH_IT=0 and does not crash.  When you change that to CRASH_IT=1, then 
it will crash.  It turns out that this just depends on a short wait 
state (3 seconds, on my machine) between setting up de replication, and 
the running of pgbench.  It's possible that on very fast machines maybe 
it does not occur; we've had such difference between hardware before. 
This is a i5-3330S.

It deletes files so look it over before you run it.  It may also depend 
on some of my local set-up but I guess that should be easily fixed.

Can you let me know if you can reproduce the problem with this?

thanks,

Erik Rijkers



>> 
>> [1]
>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>> 
>> 
>> --
>> Regards,
>> Dilip Kumar
>> EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2020-04-18 11:07, Erik Rijkers wrote:
>>> Hi Erik,
>>> 
>>> While setting up the cascading replication I have hit one issue on
>>> base code[1].  After fixing that I have got one crash with streaming
>>> on patch.  I am not sure whether you are facing any of these 2 issues
>>> or any other issue.  If your issue is not any of these then plese
>>> share the callstack and steps to reproduce.
> 
> I figured out a few things about this. Attached is a bash script
> test.sh, to reproduce:

And the attached file, test.sh.  (sorry)

> There is a variable  CRASH_IT  that determines whether the whole thing
> will fail (with a segmentation fault) or not.  As attached it has
> CRASH_IT=0 and does not crash.  When you change that to CRASH_IT=1,
> then it will crash.  It turns out that this just depends on a short
> wait state (3 seconds, on my machine) between setting up de
> replication, and the running of pgbench.  It's possible that on very
> fast machines maybe it does not occur; we've had such difference
> between hardware before. This is a i5-3330S.
> 
> It deletes files so look it over before you run it.  It may also
> depend on some of my local set-up but I guess that should be easily
> fixed.
> 
> Can you let me know if you can reproduce the problem with this?
> 
> thanks,
> 
> Erik Rijkers
> 
> 
> 
>>> 
>>> [1]
>>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>>> 
>>> 
>>> --
>>> Regards,
>>> Dilip Kumar
>>> EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2020-04-18 11:10, Erik Rijkers wrote:
> On 2020-04-18 11:07, Erik Rijkers wrote:
>>>> Hi Erik,
>>>> 
>>>> While setting up the cascading replication I have hit one issue on
>>>> base code[1].  After fixing that I have got one crash with streaming
>>>> on patch.  I am not sure whether you are facing any of these 2 
>>>> issues
>>>> or any other issue.  If your issue is not any of these then plese
>>>> share the callstack and steps to reproduce.
>> 
>> I figured out a few things about this. Attached is a bash script
>> test.sh, to reproduce:
> 
> And the attached file, test.sh.  (sorry)

It turns out I must have been mistaken somewhere.  I probably missed 
bugfix_in_schema_sent.patch)

I have just now rebuilt all the instances on top of master with these 
patches:

> [v14-0001-Immediately-WAL-log-assignments.patch]
> [v14-0002-Issue-individual-invalidations-with.patch]
> [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> [v14-0007-Track-statistics-for-streaming.patch]
> [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> [bugfix_in_schema_sent.patch]

    (by the way: this build's regression tests  'ddl', 'toast', and 
'spill' fail)

I seem now able to run all my test programs on these instances without 
errors.

Sorry, I seem to have raised a false alarm (although there was initially 
certainly a problem).


Erik Rijkers



>> There is a variable  CRASH_IT  that determines whether the whole thing
>> will fail (with a segmentation fault) or not.  As attached it has
>> CRASH_IT=0 and does not crash.  When you change that to CRASH_IT=1,
>> then it will crash.  It turns out that this just depends on a short
>> wait state (3 seconds, on my machine) between setting up de
>> replication, and the running of pgbench.  It's possible that on very
>> fast machines maybe it does not occur; we've had such difference
>> between hardware before. This is a i5-3330S.
>> 
>> It deletes files so look it over before you run it.  It may also
>> depend on some of my local set-up but I guess that should be easily
>> fixed.
>> 
>> Can you let me know if you can reproduce the problem with this?
>> 
>> thanks,
>> 
>> Erik Rijkers
>> 
>> 
>> 
>>>> 
>>>> [1]
>>>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>>>> 
>>>> 
>>>> --
>>>> Regards,
>>>> Dilip Kumar
>>>> EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-18 11:10, Erik Rijkers wrote:
> > On 2020-04-18 11:07, Erik Rijkers wrote:
> >>>> Hi Erik,
> >>>>
> >>>> While setting up the cascading replication I have hit one issue on
> >>>> base code[1].  After fixing that I have got one crash with streaming
> >>>> on patch.  I am not sure whether you are facing any of these 2
> >>>> issues
> >>>> or any other issue.  If your issue is not any of these then plese
> >>>> share the callstack and steps to reproduce.
> >>
> >> I figured out a few things about this. Attached is a bash script
> >> test.sh, to reproduce:
> >
> > And the attached file, test.sh.  (sorry)
>
> It turns out I must have been mistaken somewhere.  I probably missed
> bugfix_in_schema_sent.patch)
>
> I have just now rebuilt all the instances on top of master with these
> patches:
>
> > [v14-0001-Immediately-WAL-log-assignments.patch]
> > [v14-0002-Issue-individual-invalidations-with.patch]
> > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > [v14-0007-Track-statistics-for-streaming.patch]
> > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > [bugfix_in_schema_sent.patch]
>
>     (by the way: this build's regression tests  'ddl', 'toast', and
> 'spill' fail)
>
> I seem now able to run all my test programs on these instances without
> errors.
>
> Sorry, I seem to have raised a false alarm (although there was initially
> certainly a problem).

No problem,  Thanks for confirming.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-18 11:10, Erik Rijkers wrote:
> > On 2020-04-18 11:07, Erik Rijkers wrote:
> >>>> Hi Erik,
> >>>>
> >>>> While setting up the cascading replication I have hit one issue on
> >>>> base code[1].  After fixing that I have got one crash with streaming
> >>>> on patch.  I am not sure whether you are facing any of these 2
> >>>> issues
> >>>> or any other issue.  If your issue is not any of these then plese
> >>>> share the callstack and steps to reproduce.
> >>
> >> I figured out a few things about this. Attached is a bash script
> >> test.sh, to reproduce:
> >
> > And the attached file, test.sh.  (sorry)
>
> It turns out I must have been mistaken somewhere.  I probably missed
> bugfix_in_schema_sent.patch)
>
> I have just now rebuilt all the instances on top of master with these
> patches:
>
> > [v14-0001-Immediately-WAL-log-assignments.patch]
> > [v14-0002-Issue-individual-invalidations-with.patch]
> > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > [v14-0007-Track-statistics-for-streaming.patch]
> > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > [bugfix_in_schema_sent.patch]
>
>     (by the way: this build's regression tests  'ddl', 'toast', and
> 'spill' fail)

Yeah, this is a. known issue, actually, while streaming the
transaction the output message is changed.  I have a plan to work on
this part.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote:
> >
> > On 2020-04-18 11:10, Erik Rijkers wrote:
> > > On 2020-04-18 11:07, Erik Rijkers wrote:
> > >>>> Hi Erik,
> > >>>>
> > >>>> While setting up the cascading replication I have hit one issue on
> > >>>> base code[1].  After fixing that I have got one crash with streaming
> > >>>> on patch.  I am not sure whether you are facing any of these 2
> > >>>> issues
> > >>>> or any other issue.  If your issue is not any of these then plese
> > >>>> share the callstack and steps to reproduce.
> > >>
> > >> I figured out a few things about this. Attached is a bash script
> > >> test.sh, to reproduce:
> > >
> > > And the attached file, test.sh.  (sorry)
> >
> > It turns out I must have been mistaken somewhere.  I probably missed
> > bugfix_in_schema_sent.patch)
> >
> > I have just now rebuilt all the instances on top of master with these
> > patches:
> >
> > > [v14-0001-Immediately-WAL-log-assignments.patch]
> > > [v14-0002-Issue-individual-invalidations-with.patch]
> > > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > > [v14-0007-Track-statistics-for-streaming.patch]
> > > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > > [bugfix_in_schema_sent.patch]
> >
> >     (by the way: this build's regression tests  'ddl', 'toast', and
> > 'spill' fail)
>
> Yeah, this is a. known issue, actually, while streaming the
> transaction the output message is changed.  I have a plan to work on
> this part.

I have fixed this part.  Basically, now, I have created a separate
function to get the streaming changes
'pg_logical_slot_get_streaming_changes'.  So the default function
pg_logical_slot_get_changes will work as it is and test decoding test
cases will not fail.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2020-04-22 16:49, Dilip Kumar wrote:
> On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com> 
> wrote:
>> 
>> >
>> >     (by the way: this build's regression tests  'ddl', 'toast', and
>> > 'spill' fail)
>> 
>> Yeah, this is a. known issue, actually, while streaming the
>> transaction the output message is changed.  I have a plan to work on
>> this part.
> 
> I have fixed this part.  Basically, now, I have created a separate
> function to get the streaming changes
> 'pg_logical_slot_get_streaming_changes'.  So the default function
> pg_logical_slot_get_changes will work as it is and test decoding test
> cases will not fail.

The 'ddl' one is apparently not quite fixed  - I get this in (cd 
contrib; make check)' (in both assert-enabled and non-assert-enabled 
build)

grep -A7 -B7 make.check_contrib.out:

contrib/make.check_contrib.out-============== initializing database 
system           ==============
contrib/make.check_contrib.out-============== starting postmaster        
             ==============
contrib/make.check_contrib.out-running on port 64464 with PID 9175
contrib/make.check_contrib.out-============== creating database 
"contrib_regression" ==============
contrib/make.check_contrib.out-CREATE DATABASE
contrib/make.check_contrib.out-ALTER DATABASE
contrib/make.check_contrib.out-============== running regression test 
queries        ==============
contrib/make.check_contrib.out:test ddl                          ... 
FAILED      840 ms
contrib/make.check_contrib.out-test xact                         ... ok  
          24 ms
contrib/make.check_contrib.out-test rewrite                      ... ok  
         187 ms
contrib/make.check_contrib.out-test toast                        ... ok  
         851 ms
contrib/make.check_contrib.out-test permissions                  ... ok  
          26 ms
contrib/make.check_contrib.out-test decoding_in_xact             ... ok  
          31 ms
contrib/make.check_contrib.out-test decoding_into_rel            ... ok  
          25 ms
contrib/make.check_contrib.out-test binary                       ... ok  
          12 ms

Otherwise patches apply and build OK so will go run some tests...






Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-22 16:49, Dilip Kumar wrote:
> > On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com>
> > wrote:
> >>
> >> >
> >> >     (by the way: this build's regression tests  'ddl', 'toast', and
> >> > 'spill' fail)
> >>
> >> Yeah, this is a. known issue, actually, while streaming the
> >> transaction the output message is changed.  I have a plan to work on
> >> this part.
> >
> > I have fixed this part.  Basically, now, I have created a separate
> > function to get the streaming changes
> > 'pg_logical_slot_get_streaming_changes'.  So the default function
> > pg_logical_slot_get_changes will work as it is and test decoding test
> > cases will not fail.
>
> The 'ddl' one is apparently not quite fixed  - I get this in (cd
> contrib; make check)' (in both assert-enabled and non-assert-enabled
> build)

Can you send me the contrib/test_decoding/regression.diffs file?

> grep -A7 -B7 make.check_contrib.out:
>
> contrib/make.check_contrib.out-============== initializing database
> system           ==============
> contrib/make.check_contrib.out-============== starting postmaster
>              ==============
> contrib/make.check_contrib.out-running on port 64464 with PID 9175
> contrib/make.check_contrib.out-============== creating database
> "contrib_regression" ==============
> contrib/make.check_contrib.out-CREATE DATABASE
> contrib/make.check_contrib.out-ALTER DATABASE
> contrib/make.check_contrib.out-============== running regression test
> queries        ==============
> contrib/make.check_contrib.out:test ddl                          ...
> FAILED      840 ms
> contrib/make.check_contrib.out-test xact                         ... ok
>           24 ms
> contrib/make.check_contrib.out-test rewrite                      ... ok
>          187 ms
> contrib/make.check_contrib.out-test toast                        ... ok
>          851 ms
> contrib/make.check_contrib.out-test permissions                  ... ok
>           26 ms
> contrib/make.check_contrib.out-test decoding_in_xact             ... ok
>           31 ms
> contrib/make.check_contrib.out-test decoding_into_rel            ... ok
>           25 ms
> contrib/make.check_contrib.out-test binary                       ... ok
>           12 ms
>
> Otherwise patches apply and build OK so will go run some tests...

Thanks


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2020-04-23 05:24, Dilip Kumar wrote:
> On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
>> 
>> The 'ddl' one is apparently not quite fixed  - I get this in (cd
>> contrib; make check)' (in both assert-enabled and non-assert-enabled
>> build)
> 
> Can you send me the contrib/test_decoding/regression.diffs file?

Attached.


Below is the patch list, in case that was unclear

20200422/v15-0001-Immediately-WAL-log-assignments.patch                 
+
20200422/v15-0002-Issue-individual-invalidations-with-wal_level-lo.patch+
20200422/v15-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
20200422/v15-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
20200422/v15-0005-Implement-streaming-mode-in-ReorderBuffer.patch       
+
20200422/v15-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
20200422/v15-0007-Track-statistics-for-streaming.patch                  
+
20200422/v15-0008-Enable-streaming-for-all-subscription-TAP-tests.patch 
+
20200422/v15-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              
+
20200422/v15-0010-Bugfix-handling-of-incomplete-toast-tuple.patch       
+
20200422/v15-0011-Provide-new-api-to-get-the-streaming-changes.patch    
+
20200414/bugfix_in_schema_sent.patch



>> grep -A7 -B7 make.check_contrib.out:
>> 
>> contrib/make.check_contrib.out-============== initializing database
>> system           ==============
>> contrib/make.check_contrib.out-============== starting postmaster
>>              ==============
>> contrib/make.check_contrib.out-running on port 64464 with PID 9175
>> contrib/make.check_contrib.out-============== creating database
>> "contrib_regression" ==============
>> contrib/make.check_contrib.out-CREATE DATABASE
>> contrib/make.check_contrib.out-ALTER DATABASE
>> contrib/make.check_contrib.out-============== running regression test
>> queries        ==============
>> contrib/make.check_contrib.out:test ddl                          ...
>> FAILED      840 ms
>> contrib/make.check_contrib.out-test xact                         ... 
>> ok
>>           24 ms
>> contrib/make.check_contrib.out-test rewrite                      ... 
>> ok
>>          187 ms
>> contrib/make.check_contrib.out-test toast                        ... 
>> ok
>>          851 ms
>> contrib/make.check_contrib.out-test permissions                  ... 
>> ok
>>           26 ms
>> contrib/make.check_contrib.out-test decoding_in_xact             ... 
>> ok
>>           31 ms
>> contrib/make.check_contrib.out-test decoding_into_rel            ... 
>> ok
>>           25 ms
>> contrib/make.check_contrib.out-test binary                       ... 
>> ok
>>           12 ms
>> 
>> Otherwise patches apply and build OK so will go run some tests...
> 
> Thanks
> 
> 
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-23 05:24, Dilip Kumar wrote:
> > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
> >>
> >> The 'ddl' one is apparently not quite fixed  - I get this in (cd
> >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> >> build)
> >
> > Can you send me the contrib/test_decoding/regression.diffs file?
>
> Attached.

So from regression.diff, it appears that in failing in memory
allocation (+ERROR:  invalid memory alloc request size
94119198201896).  My colleague tried to reproduce this in a different
environment but there is no success so far.  One more thing surprises
me is that after
(v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
actually, it should never go for the streaming path. However, we can
not ignore the fact that some of the changes might impact the
non-streaming path as well.  Is it possible for you to somehow stop or
break the code and send the stack trace?  One idea is by seeing the
log we can see from where the error is raised i.e MemoryContextAlloc
or palloc or some other similar function.  Once we know that we can
convert that error to an assert and find the call stack.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, Apr 17, 2020 at 1:40 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> Few review comments from 0006-Add-support-for-streaming*.patch
>
> + subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
> lseek can return (-)ve value in case of error, right?
>
> + /*
> + * We might need to create the tablespace's tempfile directory, if no
> + * one has yet done so.
> + *
> + * Don't check for error from mkdir; it could fail if the directory
> + * already exists (maybe someone else just did the same thing).  If
> + * it doesn't work then we'll bomb out when opening the file
> + */
> + mkdir(tempdirpath, S_IRWXU);
> If that's the only reason, perhaps we can use something like following:
>
> if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
> throw error;

Done

>
> +
> + CloseTransientFile(stream_fd);
> Might failed to close the file. We should handle the case.

Changed

Still, one place is pending because I don't have the filename there to
report an error.  One option is we can just give an error without the
filename.  I will try to think about this part.

> Also, I think we need some implementations in dumpSubscription() to
> dump the (streaming = 'on') option.

Right,  created another patch and attached.

I have also fixed a couple of bugs internally reported by my colleague
Neha Sharma.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have also fixed a couple of bugs internally reported by my colleague
> Neha Sharma.
>

I think it would be good if you can briefly explain what were the bugs
and how you fixed those?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Apr 27, 2020 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have also fixed a couple of bugs internally reported by my colleague
> > Neha Sharma.
> >
>
> I think it would be good if you can briefly explain what were the bugs
> and how you fixed those?

Issue1:  If the concurrent transaction was aborted then in CATCH block
we were not freeing the memory of the toast_has, and it was causing
the assert that after the stream is complete txn->size != 0.

Issue2: After streaming is complete we set the txn->final_lsn and we
remember that in the local variable,  But mistakenly it was remembered
in local TRY block variable so if there is a concurrent abort in the
CATCH block the variable value is always a zero.  So after streaming
the final_lsn were becoming 0 and that was asserting.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
[latest patches]

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
  bool valid;

  /*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+

I think comments and code don't match.  In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it.  I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
original discussion about this point [1] (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").

Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level?  There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.

[1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> [latest patches]
>
> v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> -     Any actions leading to transaction ID assignment are prohibited.
> That, among others,
> +     Note that access to user catalog tables or regular system catalog tables
> +     in the output plugins has to be done via the
> <literal>systable_*</literal> scan APIs only.
> +     Access via the <literal>heap_*</literal> scan APIs will error out.
> +     Additionally, any actions leading to transaction ID assignment
> are prohibited. That, among others,
> ..
> @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
>   bool valid;
>
>   /*
> + * We don't expect direct calls to heap_fetch with valid
> + * CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> +
>
> I think comments and code don't match.  In the comment, we are saying
> that via output plugins access to user catalog tables or regular
> system catalog tables won't be allowed via heap_* APIs but code
> doesn't seem to reflect it.  I feel only
> TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> original discussion about this point [1] (Refer "I think it'd also be
> good to add assertions to codepaths not going through systable_*
> asserting that ...").

Right,  So I think we can just add an assert in these function that
Assert(!TransactionIdIsValid(CheckXidAlive)) ?

>
> Isn't it better to block the scan to user catalog tables or regular
> system catalog tables for tableam scan APIs rather than at the heap
> level?  There might be some APIs like heap_getnext where such a check
> might still be required but I guess it is still better to block at
> tableam level.
>
> [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de

Okay, let me analyze this part.  Because someplace we have to keep at
heap level like heap_getnext and other places at tableam level so it
seems a bit inconsistent.  Also, I think the number of checks might
going to increase because some of the heap functions like
heap_hot_search_buffer are being called from multiple tableam calls,
so we need to put check at every place.

Another point is that I feel some of the checks what we have today
might not be required like heap_finish_speculative, is not fetching
any tuple for us so why do we need to care about this function?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > [latest patches]
> >
> > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > -     Any actions leading to transaction ID assignment are prohibited.
> > That, among others,
> > +     Note that access to user catalog tables or regular system catalog tables
> > +     in the output plugins has to be done via the
> > <literal>systable_*</literal> scan APIs only.
> > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> > +     Additionally, any actions leading to transaction ID assignment
> > are prohibited. That, among others,
> > ..
> > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> >   bool valid;
> >
> >   /*
> > + * We don't expect direct calls to heap_fetch with valid
> > + * CheckXidAlive for regular tables. Track that below.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > +
> >
> > I think comments and code don't match.  In the comment, we are saying
> > that via output plugins access to user catalog tables or regular
> > system catalog tables won't be allowed via heap_* APIs but code
> > doesn't seem to reflect it.  I feel only
> > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > original discussion about this point [1] (Refer "I think it'd also be
> > good to add assertions to codepaths not going through systable_*
> > asserting that ...").
>
> Right,  So I think we can just add an assert in these function that
> Assert(!TransactionIdIsValid(CheckXidAlive)) ?
>

I am fine with Assertion but update the documentation accordingly.
However, I think you can once cross-verify if there are any output
plugins that are already using such APIs.  There is a list of "Logical
Decoding Plugins" on the wiki [1], just look into those once.

> >
> > Isn't it better to block the scan to user catalog tables or regular
> > system catalog tables for tableam scan APIs rather than at the heap
> > level?  There might be some APIs like heap_getnext where such a check
> > might still be required but I guess it is still better to block at
> > tableam level.
> >
> > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
>
> Okay, let me analyze this part.  Because someplace we have to keep at
> heap level like heap_getnext and other places at tableam level so it
> seems a bit inconsistent.  Also, I think the number of checks might
> going to increase because some of the heap functions like
> heap_hot_search_buffer are being called from multiple tableam calls,
> so we need to put check at every place.
>
> Another point is that I feel some of the checks what we have today
> might not be required like heap_finish_speculative, is not fetching
> any tuple for us so why do we need to care about this function?
>

Yeah, I don't see the need for such a check (or Assertion) in
heap_finish_speculative.

One additional comment:
---------------------------------------
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
+     Note that access to user catalog tables or regular system catalog tables
+     in the output plugins has to be done via the
<literal>systable_*</literal> scan APIs only.
+     Access via the <literal>heap_*</literal> scan APIs will error out.
+     Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,

The above text doesn't seem to be aligned properly and you need to
update it if we want to change the error to Assertion for heap APIs

[1] - https://wiki.postgresql.org/wiki/Logical_Decoding_Plugins

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Mahendra Singh Thalor
Дата:
On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:
> >
> > On 2020-04-23 05:24, Dilip Kumar wrote:
> > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
> > >>
> > >> The 'ddl' one is apparently not quite fixed  - I get this in (cd
> > >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> > >> build)
> > >
> > > Can you send me the contrib/test_decoding/regression.diffs file?
> >
> > Attached.
>
> So from regression.diff, it appears that in failing in memory
> allocation (+ERROR:  invalid memory alloc request size
> 94119198201896).  My colleague tried to reproduce this in a different
> environment but there is no success so far.  One more thing surprises
> me is that after
> (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
> actually, it should never go for the streaming path. However, we can
> not ignore the fact that some of the changes might impact the
> non-streaming path as well.  Is it possible for you to somehow stop or
> break the code and send the stack trace?  One idea is by seeing the
> log we can see from where the error is raised i.e MemoryContextAlloc
> or palloc or some other similar function.  Once we know that we can
> convert that error to an assert and find the call stack.
>
> --

Thanks Erik for reporting this issue.

I am able to reproduce this issue(+ERROR:  invalid memory alloc
request size) on the top of v16 patch set. I applied all patches(12
patches) of v16 series and then I fired "make check -i" from
"contrib/test_decoding" folder. Below is stack trace of error:

#0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
size=94605581787992) at mcxt.c:806
#1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
(rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
reorderbuffer.c:3680
#2 0x0000560b130f0662 in ReorderBufferRestoreChanges
(rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
segno=0x560b1418ad20) at reorderbuffer.c:3564
#3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
#4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
command_id=0, streaming=false)
at reorderbuffer.c:1785
#5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
xid=508, commit_lsn=25986584, end_lsn=25989088,
commit_time=641449268431600, origin_id=0, origin_lsn=0)
at reorderbuffer.c:2315
#6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
#7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0) at decode.c:261
#8 0x0000560b130cf99a in LogicalDecodingProcessRecord
(ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
#9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
(fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
at logicalfuncs.c:285
#10 0x0000560b130dbe71 in pg_logical_slot_get_changes
(fcinfo=0x560b1417ee50) at logicalfuncs.c:354
#11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
(setexpr=0x560b14177838, econtext=0x560b14177748,
argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
randomAccess=false) at execSRF.c:234
#12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
nodeFunctionscan.c:94
#13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:133
#14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:199
#15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
nodeFunctionscan.c:270
#16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
execProcnode.c:450
#17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
../../../src/include/executor/executor.h:245
#18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
at nodeAgg.c:566
#19 0x0000560b12e4398f in agg_fill_hash_table
(aggstate=0x560b14176f40) at nodeAgg.c:2518
#20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
#21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
execProcnode.c:450
#22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
../../../src/include/executor/executor.h:245
#23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
#24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
execProcnode.c:450
#25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
../../../src/include/executor/executor.h:245
#26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
planstate=0x560b14176d28, use_parallel_mode=false,
operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x560b1419d188,
execute_once=true) at execMain.c:1646
#27 0x0000560b12e11a19 in standard_ExecutorRun
(queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:364
#28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:308
#29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
#30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x560b1419d188, altdest=0x560b1419d188,
qc=0x7ffef18b2350) at pquery.c:756
#31 0x0000560b131e550b in exec_simple_query (
query_string=0x560b14076720 "/ display results, but hide most of the
output /\nSELECT count(*), min(data), max(data)\nFROM
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
postgres.c:1239
#32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
"mahendrathalor") at postgres.c:4315
#33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
#34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
postmaster.c:4202
#35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
#36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
at postmaster.c:1400
#37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210

I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
am looking into this issue with Dilip.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Mahendra Singh Thalor
Дата:
On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>
> On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:
> > >
> > > On 2020-04-23 05:24, Dilip Kumar wrote:
> > > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > >>
> > > >> The 'ddl' one is apparently not quite fixed  - I get this in (cd
> > > >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> > > >> build)
> > > >
> > > > Can you send me the contrib/test_decoding/regression.diffs file?
> > >
> > > Attached.
> >
> > So from regression.diff, it appears that in failing in memory
> > allocation (+ERROR:  invalid memory alloc request size
> > 94119198201896).  My colleague tried to reproduce this in a different
> > environment but there is no success so far.  One more thing surprises
> > me is that after
> > (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
> > actually, it should never go for the streaming path. However, we can
> > not ignore the fact that some of the changes might impact the
> > non-streaming path as well.  Is it possible for you to somehow stop or
> > break the code and send the stack trace?  One idea is by seeing the
> > log we can see from where the error is raised i.e MemoryContextAlloc
> > or palloc or some other similar function.  Once we know that we can
> > convert that error to an assert and find the call stack.
> >
> > --
>
> Thanks Erik for reporting this issue.
>
> I am able to reproduce this issue(+ERROR:  invalid memory alloc
> request size) on the top of v16 patch set. I applied all patches(12
> patches) of v16 series and then I fired "make check -i" from
> "contrib/test_decoding" folder. Below is stack trace of error:
>
> #0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
> size=94605581787992) at mcxt.c:806
> #1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
> (rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
> reorderbuffer.c:3680
> #2 0x0000560b130f0662 in ReorderBufferRestoreChanges
> (rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
> segno=0x560b1418ad20) at reorderbuffer.c:3564
> #3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
> txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
> #4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
> txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
> command_id=0, streaming=false)
> at reorderbuffer.c:1785
> #5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
> xid=508, commit_lsn=25986584, end_lsn=25989088,
> commit_time=641449268431600, origin_id=0, origin_lsn=0)
> at reorderbuffer.c:2315
> #6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
> buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
> #7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
> buf=0x7ffef18b19b0) at decode.c:261
> #8 0x0000560b130cf99a in LogicalDecodingProcessRecord
> (ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
> #9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
> (fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
> at logicalfuncs.c:285
> #10 0x0000560b130dbe71 in pg_logical_slot_get_changes
> (fcinfo=0x560b1417ee50) at logicalfuncs.c:354
> #11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
> (setexpr=0x560b14177838, econtext=0x560b14177748,
> argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
> randomAccess=false) at execSRF.c:234
> #12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
> nodeFunctionscan.c:94
> #13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
> accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> <FunctionRecheck>) at execScan.c:133
> #14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
> accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> <FunctionRecheck>) at execScan.c:199
> #15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
> nodeFunctionscan.c:270
> #16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
> execProcnode.c:450
> #17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
> ../../../src/include/executor/executor.h:245
> #18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
> at nodeAgg.c:566
> #19 0x0000560b12e4398f in agg_fill_hash_table
> (aggstate=0x560b14176f40) at nodeAgg.c:2518
> #20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
> #21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
> execProcnode.c:450
> #22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
> ../../../src/include/executor/executor.h:245
> #23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
> #24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
> execProcnode.c:450
> #25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
> ../../../src/include/executor/executor.h:245
> #26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
> planstate=0x560b14176d28, use_parallel_mode=false,
> operation=CMD_SELECT, sendTuples=true, numberTuples=0,
> direction=ForwardScanDirection, dest=0x560b1419d188,
> execute_once=true) at execMain.c:1646
> #27 0x0000560b12e11a19 in standard_ExecutorRun
> (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
> execute_once=true) at execMain.c:364
> #28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
> direction=ForwardScanDirection, count=0, execute_once=true) at
> execMain.c:308
> #29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
> forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
> #30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
> count=9223372036854775807, isTopLevel=true, run_once=true,
> dest=0x560b1419d188, altdest=0x560b1419d188,
> qc=0x7ffef18b2350) at pquery.c:756
> #31 0x0000560b131e550b in exec_simple_query (
> query_string=0x560b14076720 "/ display results, but hide most of the
> output /\nSELECT count(*), min(data), max(data)\nFROM
> pg_logical_slot_get_changes('regression_slot', NULL, NULL,
> 'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
> postgres.c:1239
> #32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
> dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
> "mahendrathalor") at postgres.c:4315
> #33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
> #34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
> postmaster.c:4202
> #35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
> #36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
> at postmaster.c:1400
> #37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210
>
> I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
> am looking into this issue with Dilip.

This error is due to invalid size.

diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index eed9a5048b..487c1b4252 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,

                                change->data.inval.invalidations =
                                                MemoryContextAlloc(rb->context,
-
            change->data.msg.message_size);
+
            inval_size);
                                /* read the message */

memcpy(change->data.inval.invalidations, data, inval_size);
                                data += inval_size;

Above change, fixes the error. Thanks Dilip for helping.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Apr 29, 2020 at 12:37 PM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:
>
> On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
> >
> > On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > >
> > > > On 2020-04-23 05:24, Dilip Kumar wrote:
> > > > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > > >>
> > > > >> The 'ddl' one is apparently not quite fixed  - I get this in (cd
> > > > >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> > > > >> build)
> > > > >
> > > > > Can you send me the contrib/test_decoding/regression.diffs file?
> > > >
> > > > Attached.
> > >
> > > So from regression.diff, it appears that in failing in memory
> > > allocation (+ERROR:  invalid memory alloc request size
> > > 94119198201896).  My colleague tried to reproduce this in a different
> > > environment but there is no success so far.  One more thing surprises
> > > me is that after
> > > (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
> > > actually, it should never go for the streaming path. However, we can
> > > not ignore the fact that some of the changes might impact the
> > > non-streaming path as well.  Is it possible for you to somehow stop or
> > > break the code and send the stack trace?  One idea is by seeing the
> > > log we can see from where the error is raised i.e MemoryContextAlloc
> > > or palloc or some other similar function.  Once we know that we can
> > > convert that error to an assert and find the call stack.
> > >
> > > --
> >
> > Thanks Erik for reporting this issue.
> >
> > I am able to reproduce this issue(+ERROR:  invalid memory alloc
> > request size) on the top of v16 patch set. I applied all patches(12
> > patches) of v16 series and then I fired "make check -i" from
> > "contrib/test_decoding" folder. Below is stack trace of error:
> >
> > #0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
> > size=94605581787992) at mcxt.c:806
> > #1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
> > (rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
> > reorderbuffer.c:3680
> > #2 0x0000560b130f0662 in ReorderBufferRestoreChanges
> > (rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
> > segno=0x560b1418ad20) at reorderbuffer.c:3564
> > #3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
> > txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
> > #4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
> > txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
> > command_id=0, streaming=false)
> > at reorderbuffer.c:1785
> > #5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
> > xid=508, commit_lsn=25986584, end_lsn=25989088,
> > commit_time=641449268431600, origin_id=0, origin_lsn=0)
> > at reorderbuffer.c:2315
> > #6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
> > buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
> > #7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
> > buf=0x7ffef18b19b0) at decode.c:261
> > #8 0x0000560b130cf99a in LogicalDecodingProcessRecord
> > (ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
> > #9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
> > (fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
> > at logicalfuncs.c:285
> > #10 0x0000560b130dbe71 in pg_logical_slot_get_changes
> > (fcinfo=0x560b1417ee50) at logicalfuncs.c:354
> > #11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
> > (setexpr=0x560b14177838, econtext=0x560b14177748,
> > argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
> > randomAccess=false) at execSRF.c:234
> > #12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
> > nodeFunctionscan.c:94
> > #13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
> > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> > <FunctionRecheck>) at execScan.c:133
> > #14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
> > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> > <FunctionRecheck>) at execScan.c:199
> > #15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
> > nodeFunctionscan.c:270
> > #16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
> > execProcnode.c:450
> > #17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
> > ../../../src/include/executor/executor.h:245
> > #18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
> > at nodeAgg.c:566
> > #19 0x0000560b12e4398f in agg_fill_hash_table
> > (aggstate=0x560b14176f40) at nodeAgg.c:2518
> > #20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
> > #21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
> > execProcnode.c:450
> > #22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
> > ../../../src/include/executor/executor.h:245
> > #23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
> > #24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
> > execProcnode.c:450
> > #25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
> > ../../../src/include/executor/executor.h:245
> > #26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
> > planstate=0x560b14176d28, use_parallel_mode=false,
> > operation=CMD_SELECT, sendTuples=true, numberTuples=0,
> > direction=ForwardScanDirection, dest=0x560b1419d188,
> > execute_once=true) at execMain.c:1646
> > #27 0x0000560b12e11a19 in standard_ExecutorRun
> > (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
> > execute_once=true) at execMain.c:364
> > #28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
> > direction=ForwardScanDirection, count=0, execute_once=true) at
> > execMain.c:308
> > #29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
> > forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
> > #30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
> > count=9223372036854775807, isTopLevel=true, run_once=true,
> > dest=0x560b1419d188, altdest=0x560b1419d188,
> > qc=0x7ffef18b2350) at pquery.c:756
> > #31 0x0000560b131e550b in exec_simple_query (
> > query_string=0x560b14076720 "/ display results, but hide most of the
> > output /\nSELECT count(*), min(data), max(data)\nFROM
> > pg_logical_slot_get_changes('regression_slot', NULL, NULL,
> > 'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
> > postgres.c:1239
> > #32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
> > dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
> > "mahendrathalor") at postgres.c:4315
> > #33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
> > #34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
> > postmaster.c:4202
> > #35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
> > #36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
> > at postmaster.c:1400
> > #37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210
> >
> > I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
> > am looking into this issue with Dilip.
>
> This error is due to invalid size.
>
> diff --git a/src/backend/replication/logical/reorderbuffer.c
> b/src/backend/replication/logical/reorderbuffer.c
> index eed9a5048b..487c1b4252 100644
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
>
>                                 change->data.inval.invalidations =
>                                                 MemoryContextAlloc(rb->context,
> -
>             change->data.msg.message_size);
> +
>             inval_size);
>                                 /* read the message */
>
> memcpy(change->data.inval.invalidations, data, inval_size);
>                                 data += inval_size;
>
> Above change, fixes the error. Thanks Dilip for helping.

Thanks, Mahendra for reproducing and help in fixing this.  I will
include this change in my next patch set.



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > [latest patches]
> >
> > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > -     Any actions leading to transaction ID assignment are prohibited.
> > That, among others,
> > +     Note that access to user catalog tables or regular system catalog tables
> > +     in the output plugins has to be done via the
> > <literal>systable_*</literal> scan APIs only.
> > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> > +     Additionally, any actions leading to transaction ID assignment
> > are prohibited. That, among others,
> > ..
> > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> >   bool valid;
> >
> >   /*
> > + * We don't expect direct calls to heap_fetch with valid
> > + * CheckXidAlive for regular tables. Track that below.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > +
> >
> > I think comments and code don't match.  In the comment, we are saying
> > that via output plugins access to user catalog tables or regular
> > system catalog tables won't be allowed via heap_* APIs but code
> > doesn't seem to reflect it.  I feel only
> > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > original discussion about this point [1] (Refer "I think it'd also be
> > good to add assertions to codepaths not going through systable_*
> > asserting that ...").
>
> Right,  So I think we can just add an assert in these function that
> Assert(!TransactionIdIsValid(CheckXidAlive)) ?
>
> >
> > Isn't it better to block the scan to user catalog tables or regular
> > system catalog tables for tableam scan APIs rather than at the heap
> > level?  There might be some APIs like heap_getnext where such a check
> > might still be required but I guess it is still better to block at
> > tableam level.
> >
> > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
>
> Okay, let me analyze this part.  Because someplace we have to keep at
> heap level like heap_getnext and other places at tableam level so it
> seems a bit inconsistent.  Also, I think the number of checks might
> going to increase because some of the heap functions like
> heap_hot_search_buffer are being called from multiple tableam calls,
> so we need to put check at every place.
>
> Another point is that I feel some of the checks what we have today
> might not be required like heap_finish_speculative, is not fetching
> any tuple for us so why do we need to care about this function?

While testing these changes, I have noticed that the systable_* APIs
internally, calls tableam apis and so if we just put assert
Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
that assert.  Whether we put these assert in heap APIs or the tableam
APIs because systable_ always access heap through tableam APIs.

Refer below callstack
#0  table_index_fetch_tuple (scan=0x2392558, tid=0x2392270,
snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276,
all_dead=0x7fff4b6cc89e)
    at ../../../../src/include/access/tableam.h:1035
#1  0x00000000005100b6 in index_fetch_heap (scan=0x2392210,
slot=0x2391f60) at indexam.c:577
#2  0x00000000005101ea in index_getnext_slot (scan=0x2392210,
direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637
#3  0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474
#4  0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0,
relfilenode=16593) at relfilenodemap.c:213
#5  0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0,
txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168,
command_id=0, streaming=false)
    at reorderbuffer.c:1823
#6  0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518,
commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448,
origin_id=0, origin_lsn=0)
    at reorderbuffer.c:2315
#7  0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0,
buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654
#8  0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0,
buf=0x7fff4b6cce30) at decode.c:261
#9  0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0,
record=0x22e19a0) at decode.c:130

So basically, the problem is that we can not distinguish whether the
tableam/heap routine is called directly or via systable_*.

Now I understand the current code was actually giving error for the
user table not the system table with the assumption that the system
table will come to this function only via systable_*.  Only user table
can come directly.  So if this is not a system table i.e. we reach
here directly so error out.  Now, I am not sure if it is not for the
system table then what is the purpose of throwing that error?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > [latest patches]
> > >
> > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > -     Any actions leading to transaction ID assignment are prohibited.
> > > That, among others,
> > > +     Note that access to user catalog tables or regular system catalog tables
> > > +     in the output plugins has to be done via the
> > > <literal>systable_*</literal> scan APIs only.
> > > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> > > +     Additionally, any actions leading to transaction ID assignment
> > > are prohibited. That, among others,
> > > ..
> > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > >   bool valid;
> > >
> > >   /*
> > > + * We don't expect direct calls to heap_fetch with valid
> > > + * CheckXidAlive for regular tables. Track that below.
> > > + */
> > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > +
> > >
> > > I think comments and code don't match.  In the comment, we are saying
> > > that via output plugins access to user catalog tables or regular
> > > system catalog tables won't be allowed via heap_* APIs but code
> > > doesn't seem to reflect it.  I feel only
> > > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > > original discussion about this point [1] (Refer "I think it'd also be
> > > good to add assertions to codepaths not going through systable_*
> > > asserting that ...").
> >
> > Right,  So I think we can just add an assert in these function that
> > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> >
> > >
> > > Isn't it better to block the scan to user catalog tables or regular
> > > system catalog tables for tableam scan APIs rather than at the heap
> > > level?  There might be some APIs like heap_getnext where such a check
> > > might still be required but I guess it is still better to block at
> > > tableam level.
> > >
> > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> >
> > Okay, let me analyze this part.  Because someplace we have to keep at
> > heap level like heap_getnext and other places at tableam level so it
> > seems a bit inconsistent.  Also, I think the number of checks might
> > going to increase because some of the heap functions like
> > heap_hot_search_buffer are being called from multiple tableam calls,
> > so we need to put check at every place.
> >
> > Another point is that I feel some of the checks what we have today
> > might not be required like heap_finish_speculative, is not fetching
> > any tuple for us so why do we need to care about this function?
>
> While testing these changes, I have noticed that the systable_* APIs
> internally, calls tableam apis and so if we just put assert
> Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> that assert.  Whether we put these assert in heap APIs or the tableam
> APIs because systable_ always access heap through tableam APIs.
>
> Refer below callstack
> #0  table_index_fetch_tuple (scan=0x2392558, tid=0x2392270,
> snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276,
> all_dead=0x7fff4b6cc89e)
>     at ../../../../src/include/access/tableam.h:1035
> #1  0x00000000005100b6 in index_fetch_heap (scan=0x2392210,
> slot=0x2391f60) at indexam.c:577
> #2  0x00000000005101ea in index_getnext_slot (scan=0x2392210,
> direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637
> #3  0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474
> #4  0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0,
> relfilenode=16593) at relfilenodemap.c:213
> #5  0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0,
> txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168,
> command_id=0, streaming=false)
>     at reorderbuffer.c:1823
> #6  0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518,
> commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448,
> origin_id=0, origin_lsn=0)
>     at reorderbuffer.c:2315
> #7  0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0,
> buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654
> #8  0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0,
> buf=0x7fff4b6cce30) at decode.c:261
> #9  0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0,
> record=0x22e19a0) at decode.c:130
>
> So basically, the problem is that we can not distinguish whether the
> tableam/heap routine is called directly or via systable_*.
>
> Now I understand the current code was actually giving error for the
> user table not the system table with the assumption that the system
> table will come to this function only via systable_*.  Only user table
> can come directly.  So if this is not a system table i.e. we reach
> here directly so error out.  Now, I am not sure if it is not for the
> system table then what is the purpose of throwing that error?

Putting some more thought upon this, I am just wondering what do we
really want any such check because, we are always getting relation
description from the reorder buffer code, not from the pgoutput
plugin.  And, our main issue with the concurrent abort is that we
shall not get the wrong catalog entry for decoding our tuple.  So if
we are always getting our relation entry using RelationIdGetRelation
then why should we bother about how output plugin is accessing
system/user relations?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > [latest patches]
> > > >
> > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > > -     Any actions leading to transaction ID assignment are prohibited.
> > > > That, among others,
> > > > +     Note that access to user catalog tables or regular system catalog tables
> > > > +     in the output plugins has to be done via the
> > > > <literal>systable_*</literal> scan APIs only.
> > > > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> > > > +     Additionally, any actions leading to transaction ID assignment
> > > > are prohibited. That, among others,
> > > > ..
> > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > > >   bool valid;
> > > >
> > > >   /*
> > > > + * We don't expect direct calls to heap_fetch with valid
> > > > + * CheckXidAlive for regular tables. Track that below.
> > > > + */
> > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > > +
> > > >
> > > > I think comments and code don't match.  In the comment, we are saying
> > > > that via output plugins access to user catalog tables or regular
> > > > system catalog tables won't be allowed via heap_* APIs but code
> > > > doesn't seem to reflect it.  I feel only
> > > > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > > > original discussion about this point [1] (Refer "I think it'd also be
> > > > good to add assertions to codepaths not going through systable_*
> > > > asserting that ...").
> > >
> > > Right,  So I think we can just add an assert in these function that
> > > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> > >
> > > >
> > > > Isn't it better to block the scan to user catalog tables or regular
> > > > system catalog tables for tableam scan APIs rather than at the heap
> > > > level?  There might be some APIs like heap_getnext where such a check
> > > > might still be required but I guess it is still better to block at
> > > > tableam level.
> > > >
> > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> > >
> > > Okay, let me analyze this part.  Because someplace we have to keep at
> > > heap level like heap_getnext and other places at tableam level so it
> > > seems a bit inconsistent.  Also, I think the number of checks might
> > > going to increase because some of the heap functions like
> > > heap_hot_search_buffer are being called from multiple tableam calls,
> > > so we need to put check at every place.
> > >
> > > Another point is that I feel some of the checks what we have today
> > > might not be required like heap_finish_speculative, is not fetching
> > > any tuple for us so why do we need to care about this function?
> >
> > While testing these changes, I have noticed that the systable_* APIs
> > internally, calls tableam apis and so if we just put assert
> > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> > that assert.  Whether we put these assert in heap APIs or the tableam
> > APIs because systable_ always access heap through tableam APIs.
> >
..
..
>
> Putting some more thought upon this, I am just wondering what do we
> really want any such check because, we are always getting relation
> description from the reorder buffer code, not from the pgoutput
> plugin.
>

But can't they access other catalogs like pg_publication*?  I think
the basic thing we want to ensure here is that all historic accesses
always use systable* APIs to access catalogs.  We can ensure that via
having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > >
> > > > > [latest patches]
> > > > >
> > > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > > > -     Any actions leading to transaction ID assignment are prohibited.
> > > > > That, among others,
> > > > > +     Note that access to user catalog tables or regular system catalog tables
> > > > > +     in the output plugins has to be done via the
> > > > > <literal>systable_*</literal> scan APIs only.
> > > > > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> > > > > +     Additionally, any actions leading to transaction ID assignment
> > > > > are prohibited. That, among others,
> > > > > ..
> > > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > > > >   bool valid;
> > > > >
> > > > >   /*
> > > > > + * We don't expect direct calls to heap_fetch with valid
> > > > > + * CheckXidAlive for regular tables. Track that below.
> > > > > + */
> > > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > > > +
> > > > >
> > > > > I think comments and code don't match.  In the comment, we are saying
> > > > > that via output plugins access to user catalog tables or regular
> > > > > system catalog tables won't be allowed via heap_* APIs but code
> > > > > doesn't seem to reflect it.  I feel only
> > > > > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > > > > original discussion about this point [1] (Refer "I think it'd also be
> > > > > good to add assertions to codepaths not going through systable_*
> > > > > asserting that ...").
> > > >
> > > > Right,  So I think we can just add an assert in these function that
> > > > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> > > >
> > > > >
> > > > > Isn't it better to block the scan to user catalog tables or regular
> > > > > system catalog tables for tableam scan APIs rather than at the heap
> > > > > level?  There might be some APIs like heap_getnext where such a check
> > > > > might still be required but I guess it is still better to block at
> > > > > tableam level.
> > > > >
> > > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> > > >
> > > > Okay, let me analyze this part.  Because someplace we have to keep at
> > > > heap level like heap_getnext and other places at tableam level so it
> > > > seems a bit inconsistent.  Also, I think the number of checks might
> > > > going to increase because some of the heap functions like
> > > > heap_hot_search_buffer are being called from multiple tableam calls,
> > > > so we need to put check at every place.
> > > >
> > > > Another point is that I feel some of the checks what we have today
> > > > might not be required like heap_finish_speculative, is not fetching
> > > > any tuple for us so why do we need to care about this function?
> > >
> > > While testing these changes, I have noticed that the systable_* APIs
> > > internally, calls tableam apis and so if we just put assert
> > > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> > > that assert.  Whether we put these assert in heap APIs or the tableam
> > > APIs because systable_ always access heap through tableam APIs.
> > >
> ..
> ..
> >
> > Putting some more thought upon this, I am just wondering what do we
> > really want any such check because, we are always getting relation
> > description from the reorder buffer code, not from the pgoutput
> > plugin.
> >
>
> But can't they access other catalogs like pg_publication*?  I think
> the basic thing we want to ensure here is that all historic accesses
> always use systable* APIs to access catalogs.  We can ensure that via
> having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

Yeah, it can.  So I have changed it now, actually along with
CheckXidLive, I have kept one more flag so whenever CheckXidLive is
set and we pass through systable_beginscan we will set that flag.  So
while accessing the tableam API we will set if CheckXidLive is set
then another flag must also be set otherwise we through an error.

Apart from this, I have also fixed one defect raised by my colleague
Neha Sharma.  That issue is the incomplete toast tuple flag was not
reset when the main table tuple was inserted through speculative
insert and due to that data was not streamed even if later we were
getting speculative confirm because incomplete toast flag was never
reset.  This patch also includes the fix for the issue raised by Erik.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > But can't they access other catalogs like pg_publication*?  I think
> > the basic thing we want to ensure here is that all historic accesses
> > always use systable* APIs to access catalogs.  We can ensure that via
> > having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
>
> Yeah, it can.  So I have changed it now, actually along with
> CheckXidLive, I have kept one more flag so whenever CheckXidLive is
> set and we pass through systable_beginscan we will set that flag.  So
> while accessing the tableam API we will set if CheckXidLive is set
> then another flag must also be set otherwise we through an error.
>

Okay, I have reviewed these changes and below are my comments:

Review of  v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
--------------------------------------------------------------------
1.
+ /*
+ * If CheckXidAlive is set then set a flag that this call is passed through
+ * systable_beginscan.  See detailed  comments at snapmgr.c where these
+ * variables are declared.
+ */
+ if (TransactionIdIsValid(CheckXidAlive))
+ sysbegin_called = true;

a. How about calling this variable as bsysscan or sysscan instead of
sysbegin_called?
b. There is an extra space between detailed and comments.  A similar
change is required at other place where this comment is used.
c. How about writing the first line as "If CheckXidAlive is set then
set a flag to indicate that system table scan is in-progress."

2.
-     Any actions leading to transaction ID assignment are prohibited.
That, among others,
-     includes writing to tables, performing DDL changes, and
-     calling <literal>pg_current_xact_id()</literal>.
+     Note that access to user catalog tables or regular system
catalog tables in
+     the output plugins has to be done via the
<literal>systable_*</literal> scan
+     APIs only. The user tables should not be accesed in the output
plugins anyways.
+     Access via the <literal>heap_*</literal> scan APIs will error out.

The line "The user tables should not be accesed in the output plugins
anyways." seems a bit of out of place.  I don't think this is required
here.  If you read the previous paragraph in the same document it is
written: "Read only access to relations is permitted as long as only
relations are accessed that either have been created by
<command>initdb</command> in the <literal>pg_catalog</literal> schema,
or have been marked as user provided catalog tables using ...".  I
think that is sufficient to convey the information that the newly
added line by you is trying to convey.

3.
+ /*
+ * We don't expect direct calls to this routine when CheckXidAlive is a
+ * valid transaction id, this should only come through systable_* call.
+ * CheckXidAlive is set during logical decoding of a transactions.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
+ elog(ERROR, "unexpected heap_getnext call during logical decoding");

How about changing this comment as "We don't expect direct calls to
heap_getnext with valid CheckXidAlive for catalog or regular tables.
See detailed comments at snapmgr.c where these variables are
declared."?  Change the similar comment used in other places in the
patch.

For this specific API, we can also say "Normally we have such a check
at tableam level API but this is called from many places so we need to
ensure it here."

4.
+ * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
+ * out.  We can't directly use TransactionIdDidAbort as after crash such
+ * transaction might not have been marked as aborted.  See detailed  comments
+ * at snapmgr.c where the variable is declared.
+ */
+static inline void
+HandleConcurrentAbort()

Can we change the comments as "Error out, if CheckXidAlive is aborted.
We can't directly use TransactionIdDidAbort as after crash such
transaction might not have been marked as aborted."

After this add one empty line and then we can say something like:
"This is a special API to check if CheckXidAlive is aborted in system
table scan APIs.  See detailed comments at snapmgr.c where the
variable is declared."

5. Shouldn't we add a check in table_scan_sample_next_block and
table_scan_sample_next_tuple APIs as well?

6.
/*
+ * An xid value pointing to a possibly ongoing (sub)transaction.
+ * Currently used in logical decoding.  It's possible that such transactions
+ * can get aborted while the decoding is ongoing.  If CheckXidAlive is set
+ * then we will set sysbegin_called flag when we call systable_beginscan.  This
+ * is to ensure that from the pgoutput plugin we should never directly access
+ * the tableam or heap apis because we are checking for the concurrent abort
+ * only in systable_* apis.
+ */
+TransactionId CheckXidAlive = InvalidTransactionId;
+bool sysbegin_called = false;

Can we change the above comment as "CheckXidAlive is a xid value
pointing to a possibly ongoing (sub)transaction.  Currently, it is
used in logical decoding.  It's possible that such transactions can
get aborted while the decoding is ongoing in which case we skip
decoding that particular transaction. To ensure that we check whether
the CheckXidAlive is aborted after fetching the tuple from system
tables.  We also ensure that during logical decoding we never directly
access the tableam or heap APIs because we are checking for the
concurrent aborts only in systable_* APIs."

> Apart from this, I have also fixed one defect raised by my colleague
> Neha Sharma.  That issue is the incomplete toast tuple flag was not
> reset when the main table tuple was inserted through speculative
> insert and due to that data was not streamed even if later we were
> getting speculative confirm because incomplete toast flag was never
> reset.  This patch also includes the fix for the issue raised by Erik.
>

It would be better if you can mention which all patches contain the
changes as it will be easier to review the fix.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

> 5. Shouldn't we add a check in table_scan_sample_next_block and
> table_scan_sample_next_tuple APIs as well?

I am not sure that we need to do that,  Because generally, we want to
avoid getting any wrong system table tuple which we can use for taking
some decision or decode tuple.  But, I don't think that
table_scan_sample falls under that category.


> > Apart from this, I have also fixed one defect raised by my colleague
> > Neha Sharma.  That issue is the incomplete toast tuple flag was not
> > reset when the main table tuple was inserted through speculative
> > insert and due to that data was not streamed even if later we were
> > getting speculative confirm because incomplete toast flag was never
> > reset.  This patch also includes the fix for the issue raised by Erik.
> >
>
> It would be better if you can mention which all patches contain the
> changes as it will be easier to review the fix.

Fix1: v17-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
Fix2:  patch: v17-0002-Issue-individual-invalidations-with-wal_level-lo.patch

I will work on other comments and send the updated patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
>
> > 5. Shouldn't we add a check in table_scan_sample_next_block and
> > table_scan_sample_next_tuple APIs as well?
>
> I am not sure that we need to do that,  Because generally, we want to
> avoid getting any wrong system table tuple which we can use for taking
> some decision or decode tuple.  But, I don't think that
> table_scan_sample falls under that category.
>

Hmm, I am asking a check similar to what you have in function
table_scan_bitmap_next_block(), can't we have that one?  BTW, I
noticed a below spurious line removal in the patch we are talking
about.

+/*
  * These are updated by GetSnapshotData.  We initialize them this way
  * for the convenience of TransactionIdIsInProgress: even in bootstrap
  * mode, we don't want it to say that BootstrapTransactionId is in progress.
@@ -2043,7 +2055,6 @@ SetupHistoricSnapshot(Snapshot
historic_snapshot, HTAB *tuplecids)
  tuplecid_data = tuplecids;
 }

-



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 5, 2020 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> >
> > > 5. Shouldn't we add a check in table_scan_sample_next_block and
> > > table_scan_sample_next_tuple APIs as well?
> >
> > I am not sure that we need to do that,  Because generally, we want to
> > avoid getting any wrong system table tuple which we can use for taking
> > some decision or decode tuple.  But, I don't think that
> > table_scan_sample falls under that category.
> >
>
> Hmm, I am asking a check similar to what you have in function
> table_scan_bitmap_next_block(), can't we have that one?

Yeah we can put that and there is no harm in that,  but my point is
the table_scan_bitmap_next_block and other functions where I have put
the check are used for fetching the tuple which can be used for
decoding tuple or taking some decision, but IMHO,
table_scan_sample_next_tuple is only used for analyzing the table.  So
do we really need to do that?  Am I missing something here?


  BTW, I
> noticed a below spurious line removal in the patch we are talking
> about.
>
> +/*
>   * These are updated by GetSnapshotData.  We initialize them this way
>   * for the convenience of TransactionIdIsInProgress: even in bootstrap
>   * mode, we don't want it to say that BootstrapTransactionId is in progress.
> @@ -2043,7 +2055,6 @@ SetupHistoricSnapshot(Snapshot
> historic_snapshot, HTAB *tuplecids)
>   tuplecid_data = tuplecids;
>  }
>
> -


Okay, I will take care. of this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, May 5, 2020 at 10:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 5, 2020 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, May 5, 2020 at 9:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > >
> > > > 5. Shouldn't we add a check in table_scan_sample_next_block and
> > > > table_scan_sample_next_tuple APIs as well?
> > >
> > > I am not sure that we need to do that,  Because generally, we want to
> > > avoid getting any wrong system table tuple which we can use for taking
> > > some decision or decode tuple.  But, I don't think that
> > > table_scan_sample falls under that category.
> > >
> >
> > Hmm, I am asking a check similar to what you have in function
> > table_scan_bitmap_next_block(), can't we have that one?
>
> Yeah we can put that and there is no harm in that,  but my point is
> the table_scan_bitmap_next_block and other functions where I have put
> the check are used for fetching the tuple which can be used for
> decoding tuple or taking some decision, but IMHO,
> table_scan_sample_next_tuple is only used for analyzing the table.
>

These will be used in TABLESAMPLE scan.  Try something like "select c1
from t1 TABLESAMPLE BERNOULLI(30);".  So, I guess these APIs can also
be used to fetch the tuple.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > But can't they access other catalogs like pg_publication*?  I think
> > > the basic thing we want to ensure here is that all historic accesses
> > > always use systable* APIs to access catalogs.  We can ensure that via
> > > having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
> >
> > Yeah, it can.  So I have changed it now, actually along with
> > CheckXidLive, I have kept one more flag so whenever CheckXidLive is
> > set and we pass through systable_beginscan we will set that flag.  So
> > while accessing the tableam API we will set if CheckXidLive is set
> > then another flag must also be set otherwise we through an error.
> >
>
> Okay, I have reviewed these changes and below are my comments:
>
> Review of  v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> --------------------------------------------------------------------
> 1.
> + /*
> + * If CheckXidAlive is set then set a flag that this call is passed through
> + * systable_beginscan.  See detailed  comments at snapmgr.c where these
> + * variables are declared.
> + */
> + if (TransactionIdIsValid(CheckXidAlive))
> + sysbegin_called = true;
>
> a. How about calling this variable as bsysscan or sysscan instead of
> sysbegin_called?

Done

> b. There is an extra space between detailed and comments.  A similar
> change is required at other place where this comment is used.

Done

> c. How about writing the first line as "If CheckXidAlive is set then
> set a flag to indicate that system table scan is in-progress."
>
> 2.
> -     Any actions leading to transaction ID assignment are prohibited.
> That, among others,
> -     includes writing to tables, performing DDL changes, and
> -     calling <literal>pg_current_xact_id()</literal>.
> +     Note that access to user catalog tables or regular system
> catalog tables in
> +     the output plugins has to be done via the
> <literal>systable_*</literal> scan
> +     APIs only. The user tables should not be accesed in the output
> plugins anyways.
> +     Access via the <literal>heap_*</literal> scan APIs will error out.
>
> The line "The user tables should not be accesed in the output plugins
> anyways." seems a bit of out of place.  I don't think this is required
> here.  If you read the previous paragraph in the same document it is
> written: "Read only access to relations is permitted as long as only
> relations are accessed that either have been created by
> <command>initdb</command> in the <literal>pg_catalog</literal> schema,
> or have been marked as user provided catalog tables using ...".  I
> think that is sufficient to convey the information that the newly
> added line by you is trying to convey.

Right.

>
> 3.
> + /*
> + * We don't expect direct calls to this routine when CheckXidAlive is a
> + * valid transaction id, this should only come through systable_* call.
> + * CheckXidAlive is set during logical decoding of a transactions.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
> + elog(ERROR, "unexpected heap_getnext call during logical decoding");
>
> How about changing this comment as "We don't expect direct calls to
> heap_getnext with valid CheckXidAlive for catalog or regular tables.
> See detailed comments at snapmgr.c where these variables are
> declared."?  Change the similar comment used in other places in the
> patch.
>
> For this specific API, we can also say "Normally we have such a check
> at tableam level API but this is called from many places so we need to
> ensure it here."

Done

>
> 4.
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
> + * out.  We can't directly use TransactionIdDidAbort as after crash such
> + * transaction might not have been marked as aborted.  See detailed  comments
> + * at snapmgr.c where the variable is declared.
> + */
> +static inline void
> +HandleConcurrentAbort()
>
> Can we change the comments as "Error out, if CheckXidAlive is aborted.
> We can't directly use TransactionIdDidAbort as after crash such
> transaction might not have been marked as aborted."
>
> After this add one empty line and then we can say something like:
> "This is a special API to check if CheckXidAlive is aborted in system
> table scan APIs.  See detailed comments at snapmgr.c where the
> variable is declared."
>
> 5. Shouldn't we add a check in table_scan_sample_next_block and
> table_scan_sample_next_tuple APIs as well?

Done

> 6.
> /*
> + * An xid value pointing to a possibly ongoing (sub)transaction.
> + * Currently used in logical decoding.  It's possible that such transactions
> + * can get aborted while the decoding is ongoing.  If CheckXidAlive is set
> + * then we will set sysbegin_called flag when we call systable_beginscan.  This
> + * is to ensure that from the pgoutput plugin we should never directly access
> + * the tableam or heap apis because we are checking for the concurrent abort
> + * only in systable_* apis.
> + */
> +TransactionId CheckXidAlive = InvalidTransactionId;
> +bool sysbegin_called = false;
>
> Can we change the above comment as "CheckXidAlive is a xid value
> pointing to a possibly ongoing (sub)transaction.  Currently, it is
> used in logical decoding.  It's possible that such transactions can
> get aborted while the decoding is ongoing in which case we skip
> decoding that particular transaction. To ensure that we check whether
> the CheckXidAlive is aborted after fetching the tuple from system
> tables.  We also ensure that during logical decoding we never directly
> access the tableam or heap APIs because we are checking for the
> concurrent aborts only in systable_* APIs."

Done

I have also fixed one issue in the patch
v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch.

Basically, the check, in ReorderBufferLargestTopTXN for selecting the
largest top transaction was incorrect so I have fixed that.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 5, 2020 at 4:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > But can't they access other catalogs like pg_publication*?  I think
> > > > the basic thing we want to ensure here is that all historic accesses
> > > > always use systable* APIs to access catalogs.  We can ensure that via
> > > > having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
> > >
> > > Yeah, it can.  So I have changed it now, actually along with
> > > CheckXidLive, I have kept one more flag so whenever CheckXidLive is
> > > set and we pass through systable_beginscan we will set that flag.  So
> > > while accessing the tableam API we will set if CheckXidLive is set
> > > then another flag must also be set otherwise we through an error.
> > >
> >
> > Okay, I have reviewed these changes and below are my comments:
> >
> > Review of  v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > --------------------------------------------------------------------
> > 1.
> > + /*
> > + * If CheckXidAlive is set then set a flag that this call is passed through
> > + * systable_beginscan.  See detailed  comments at snapmgr.c where these
> > + * variables are declared.
> > + */
> > + if (TransactionIdIsValid(CheckXidAlive))
> > + sysbegin_called = true;
> >
> > a. How about calling this variable as bsysscan or sysscan instead of
> > sysbegin_called?
>
> Done
>
> > b. There is an extra space between detailed and comments.  A similar
> > change is required at other place where this comment is used.
>
> Done
>
> > c. How about writing the first line as "If CheckXidAlive is set then
> > set a flag to indicate that system table scan is in-progress."
> >
> > 2.
> > -     Any actions leading to transaction ID assignment are prohibited.
> > That, among others,
> > -     includes writing to tables, performing DDL changes, and
> > -     calling <literal>pg_current_xact_id()</literal>.
> > +     Note that access to user catalog tables or regular system
> > catalog tables in
> > +     the output plugins has to be done via the
> > <literal>systable_*</literal> scan
> > +     APIs only. The user tables should not be accesed in the output
> > plugins anyways.
> > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> >
> > The line "The user tables should not be accesed in the output plugins
> > anyways." seems a bit of out of place.  I don't think this is required
> > here.  If you read the previous paragraph in the same document it is
> > written: "Read only access to relations is permitted as long as only
> > relations are accessed that either have been created by
> > <command>initdb</command> in the <literal>pg_catalog</literal> schema,
> > or have been marked as user provided catalog tables using ...".  I
> > think that is sufficient to convey the information that the newly
> > added line by you is trying to convey.
>
> Right.
>
> >
> > 3.
> > + /*
> > + * We don't expect direct calls to this routine when CheckXidAlive is a
> > + * valid transaction id, this should only come through systable_* call.
> > + * CheckXidAlive is set during logical decoding of a transactions.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
> > + elog(ERROR, "unexpected heap_getnext call during logical decoding");
> >
> > How about changing this comment as "We don't expect direct calls to
> > heap_getnext with valid CheckXidAlive for catalog or regular tables.
> > See detailed comments at snapmgr.c where these variables are
> > declared."?  Change the similar comment used in other places in the
> > patch.
> >
> > For this specific API, we can also say "Normally we have such a check
> > at tableam level API but this is called from many places so we need to
> > ensure it here."
>
> Done
>
> >
> > 4.
> > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
> > + * out.  We can't directly use TransactionIdDidAbort as after crash such
> > + * transaction might not have been marked as aborted.  See detailed  comments
> > + * at snapmgr.c where the variable is declared.
> > + */
> > +static inline void
> > +HandleConcurrentAbort()
> >
> > Can we change the comments as "Error out, if CheckXidAlive is aborted.
> > We can't directly use TransactionIdDidAbort as after crash such
> > transaction might not have been marked as aborted."
> >
> > After this add one empty line and then we can say something like:
> > "This is a special API to check if CheckXidAlive is aborted in system
> > table scan APIs.  See detailed comments at snapmgr.c where the
> > variable is declared."
> >
> > 5. Shouldn't we add a check in table_scan_sample_next_block and
> > table_scan_sample_next_tuple APIs as well?
>
> Done
>
> > 6.
> > /*
> > + * An xid value pointing to a possibly ongoing (sub)transaction.
> > + * Currently used in logical decoding.  It's possible that such transactions
> > + * can get aborted while the decoding is ongoing.  If CheckXidAlive is set
> > + * then we will set sysbegin_called flag when we call systable_beginscan.  This
> > + * is to ensure that from the pgoutput plugin we should never directly access
> > + * the tableam or heap apis because we are checking for the concurrent abort
> > + * only in systable_* apis.
> > + */
> > +TransactionId CheckXidAlive = InvalidTransactionId;
> > +bool sysbegin_called = false;
> >
> > Can we change the above comment as "CheckXidAlive is a xid value
> > pointing to a possibly ongoing (sub)transaction.  Currently, it is
> > used in logical decoding.  It's possible that such transactions can
> > get aborted while the decoding is ongoing in which case we skip
> > decoding that particular transaction. To ensure that we check whether
> > the CheckXidAlive is aborted after fetching the tuple from system
> > tables.  We also ensure that during logical decoding we never directly
> > access the tableam or heap APIs because we are checking for the
> > concurrent aborts only in systable_* APIs."
>
> Done
>
> I have also fixed one issue in the patch
> v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch.
>
> Basically, the check, in ReorderBufferLargestTopTXN for selecting the
> largest top transaction was incorrect so I have fixed that.

There was one unrelated bug fix in v18-0010 patch reported by Neha
Sharma offlist so sending the updated version.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have fixed one more issue in 0010 patch.  The issue was that once
> the transaction is serialized due to the incomplete toast after
> streaming the serialized store was not cleaned up so it was streaming
> the same tuple multiple times.
>

I have reviewed a few patches (003, 004, and 005) and below are my comments.

v20-0003-Extend-the-output-plugin-API-with-stream-methods
----------------------------------------------------------------------------------------
1.
+static void
+pg_decode_stream_change(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ Relation relation,
+ ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}
+
+static void
+pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+   int nrelations, Relation relations[],
+   ReorderBufferChange *change)
+{
+ OutputPluginPrepareWrite(ctx, true);
+ appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
+ OutputPluginWrite(ctx, true);
+}

In the above and similar APIs, there are parameters like relation
which are not used.  I think you should add some comments atop these
APIs to explain why it is so? I guess it is because we want to keep
them similar to non-stream version of APIs and we can't display
relation or other information as the transaction is still in-progress.

2.
+   <para>
+    Similar to spill-to-disk behavior, streaming is triggered when the total
+    amount of changes decoded from the WAL (for all in-progress transactions)
+    exceeds limit defined by
<varname>logical_decoding_work_mem</varname> setting.
+    At that point the largest toplevel transaction (measured by
amount of memory
+    currently used for decoded changes) is selected and streamed.
+   </para>

I think we need to explain here the cases/exception where we need to
spill even when stream is enabled and check if this is per latest
implementation, otherwise, update it.

3.
+ * To support streaming, we require change/commit/abort callbacks. The
+ * message callback is optional, similarly to regular output plugins.

/similarly/similar

4.
+static void
+stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_start";
+ /* state.report_location = apply_lsn; */

Why can't we supply the report_location here?  I think here we need to
report txn->first_lsn if this is the very first stream and
txn->final_lsn if it is any consecutive one.

5.
+static void
+stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
+{
+ LogicalDecodingContext *ctx = cache->private_data;
+ LogicalErrorCallbackState state;
+ ErrorContextCallback errcallback;
+
+ Assert(!ctx->fast_forward);
+
+ /* We're only supposed to call this when streaming is supported. */
+ Assert(ctx->streaming);
+
+ /* Push callback + info on the error context stack */
+ state.ctx = ctx;
+ state.callback_name = "stream_stop";
+ /* state.report_location = apply_lsn; */

Can't we report txn->final_lsn here?

6. I think it will be good if we can provide an example of streaming
changes via test_decoding at
https://www.postgresql.org/docs/devel/test-decoding.html. I think we
can also explain there why the user is not expected to see the actual
data in the stream.


v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
----------------------------------------------------------------------------------------
7.
+ /*
+ * We don't expect direct calls to table_tuple_get_latest_tid with valid
+ * CheckXidAlive  for catalog or regular tables.

There is an extra space between 'CheckXidAlive' and 'for'.  I can see
similar problems in other places as well where this comment is used,
fix those as well.

8.
+/*
+ * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
+ * transaction.  Currently, it is used in logical decoding.  It's possible
+ * that such transactions can get aborted while the decoding is ongoing in
+ * which case we skip decoding that particular transaction. To ensure that we
+ * check whether the CheckXidAlive is aborted after fetching the tuple from
+ * system tables.  We also ensure that during logical decoding we never
+ * directly access the tableam or heap APIs because we are checking for the
+ * concurrent aborts only in systable_* APIs.
+ */

In this comment, there is an inconsistency in the space used after
completing the sentence. In the part "transaction. To", single space
is used whereas at other places two spaces are used after a full stop.

v20-0005-Implement-streaming-mode-in-ReorderBuffer
-----------------------------------------------------------------------------
9.
Implement streaming mode in ReorderBuffer

Instead of serializing the transaction to disk after reaching the
maximum number of changes in memory (4096 changes), we consume the
changes we have in memory and invoke new stream API methods. This
happens in ReorderBufferStreamTXN() using about the same logic as
in ReorderBufferCommit() logic.

I think the above part of the commit message needs to be updated.

10.
Theoretically, we could get rid of the k-way merge, and append the
changes to the toplevel xact directly (and remember the position
in the list in case the subxact gets aborted later).

I don't think this part of the commit message is correct as we
sometimes need to spill even during streaming.  Please check the
entire commit message and update according to the latest
implementation.

11.
- * HeapTupleSatisfiesHistoricMVCC.
+ * tqual.c's HeapTupleSatisfiesHistoricMVCC.
+ *
+ * We do build the hash table even if there are no CIDs. That's
+ * because when streaming in-progress transactions we may run into
+ * tuples with the CID before actually decoding them. Think e.g. about
+ * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
+ * yet when applying the INSERT. So we build a hash table so that
+ * ResolveCminCmaxDuringDecoding does not segfault in this case.
+ *
+ * XXX We might limit this behavior to streaming mode, and just bail
+ * out when decoding transaction at commit time (at which point it's
+ * guaranteed to see all CIDs).
  */
 static void
 ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
@@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
*rb, ReorderBufferTXN *txn)
  dlist_iter iter;
  HASHCTL hash_ctl;

- if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
- return;
-

I don't understand this change.  Why would "INSERT followed by
TRUNCATE" could lead to a tuple which can come for decode before its
CID?  The patch has made changes based on this assumption in
HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
behavior could be dependent on whether we are streaming the changes
for in-progress xact or at the commit of a transaction.  We might want
to generate a test to once validate this behavior.

Also, the comment refers to tqual.c which is wrong as this API is now
in heapam_visibility.c.

12.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.
  */
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
  {
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
  }

In the comment, the flag name 'sysbegin_called' should be bsysscan.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have fixed one more issue in 0010 patch.  The issue was that once
> > the transaction is serialized due to the incomplete toast after
> > streaming the serialized store was not cleaned up so it was streaming
> > the same tuple multiple times.
> >
>
> I have reviewed a few patches (003, 004, and 005) and below are my comments.

Thanks for the review, I am replying some of the comments where I have
confusion, others are fine.

>
> v20-0003-Extend-the-output-plugin-API-with-stream-methods
> ----------------------------------------------------------------------------------------
> 1.
> +static void
> +pg_decode_stream_change(LogicalDecodingContext *ctx,
> + ReorderBufferTXN *txn,
> + Relation relation,
> + ReorderBufferChange *change)
> +{
> + OutputPluginPrepareWrite(ctx, true);
> + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> + OutputPluginWrite(ctx, true);
> +}
> +
> +static void
> +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> +   int nrelations, Relation relations[],
> +   ReorderBufferChange *change)
> +{
> + OutputPluginPrepareWrite(ctx, true);
> + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
> + OutputPluginWrite(ctx, true);
> +}
>
> In the above and similar APIs, there are parameters like relation
> which are not used.  I think you should add some comments atop these
> APIs to explain why it is so? I guess it is because we want to keep
> them similar to non-stream version of APIs and we can't display
> relation or other information as the transaction is still in-progress.

I think because the interfaces are designed that way because other
decoding plugins might need it e.g. in pgoutput we need change and
relation but not here.  We have other similar examples also e.g.
pg_decode_message has the parameter txn but not used.  Do you think we
still need to add comments?

> 4.
> +static void
> +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> +
> + Assert(!ctx->fast_forward);
> +
> + /* We're only supposed to call this when streaming is supported. */
> + Assert(ctx->streaming);
> +
> + /* Push callback + info on the error context stack */
> + state.ctx = ctx;
> + state.callback_name = "stream_start";
> + /* state.report_location = apply_lsn; */
>
> Why can't we supply the report_location here?  I think here we need to
> report txn->first_lsn if this is the very first stream and
> txn->final_lsn if it is any consecutive one.

I am not sure about this,  Because for the very first stream we will
report the location of the first lsn of the stream and for the
consecutive stream we will report the last lsn in the stream.

>
> 11.
> - * HeapTupleSatisfiesHistoricMVCC.
> + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> + *
> + * We do build the hash table even if there are no CIDs. That's
> + * because when streaming in-progress transactions we may run into
> + * tuples with the CID before actually decoding them. Think e.g. about
> + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> + * yet when applying the INSERT. So we build a hash table so that
> + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> + *
> + * XXX We might limit this behavior to streaming mode, and just bail
> + * out when decoding transaction at commit time (at which point it's
> + * guaranteed to see all CIDs).
>   */
>  static void
>  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> *rb, ReorderBufferTXN *txn)
>   dlist_iter iter;
>   HASHCTL hash_ctl;
>
> - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> - return;
> -
>
> I don't understand this change.  Why would "INSERT followed by
> TRUNCATE" could lead to a tuple which can come for decode before its
> CID?

Actually, even if we haven't decoded the DDL operation but in the
actual system table the tuple might have been deleted from the next
operation.  e.g. while we are streaming the INSERT it is possible that
the truncate has already deleted that tuple and set the max for the
tuple.  So before streaming patch, we were only streaming the INSERT
only on commit so by that time we had got all the operation which has
done DDL and we would have already prepared tuple CID hash.

  The patch has made changes based on this assumption in
> HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> behavior could be dependent on whether we are streaming the changes
> for in-progress xact or at the commit of a transaction.  We might want
> to generate a test to once validate this behavior.

We have already added the test case for the same, 011_stream_ddl.pl in
test/subscription

> Also, the comment refers to tqual.c which is wrong as this API is now
> in heapam_visibility.c.

Ok, will fix.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > v20-0003-Extend-the-output-plugin-API-with-stream-methods
> > ----------------------------------------------------------------------------------------
> > 1.
> > +static void
> > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > + ReorderBufferTXN *txn,
> > + Relation relation,
> > + ReorderBufferChange *change)
> > +{
> > + OutputPluginPrepareWrite(ctx, true);
> > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > + OutputPluginWrite(ctx, true);
> > +}
> > +
> > +static void
> > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> > +   int nrelations, Relation relations[],
> > +   ReorderBufferChange *change)
> > +{
> > + OutputPluginPrepareWrite(ctx, true);
> > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
> > + OutputPluginWrite(ctx, true);
> > +}
> >
> > In the above and similar APIs, there are parameters like relation
> > which are not used.  I think you should add some comments atop these
> > APIs to explain why it is so? I guess it is because we want to keep
> > them similar to non-stream version of APIs and we can't display
> > relation or other information as the transaction is still in-progress.
>
> I think because the interfaces are designed that way because other
> decoding plugins might need it e.g. in pgoutput we need change and
> relation but not here.  We have other similar examples also e.g.
> pg_decode_message has the parameter txn but not used.  Do you think we
> still need to add comments?
>

In that case, we can leave but lets ensure that we are not exposing
any parameter which is not used and if there is any due to some
reason, we should document it. I will also look into this.

> > 4.
> > +static void
> > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > +{
> > + LogicalDecodingContext *ctx = cache->private_data;
> > + LogicalErrorCallbackState state;
> > + ErrorContextCallback errcallback;
> > +
> > + Assert(!ctx->fast_forward);
> > +
> > + /* We're only supposed to call this when streaming is supported. */
> > + Assert(ctx->streaming);
> > +
> > + /* Push callback + info on the error context stack */
> > + state.ctx = ctx;
> > + state.callback_name = "stream_start";
> > + /* state.report_location = apply_lsn; */
> >
> > Why can't we supply the report_location here?  I think here we need to
> > report txn->first_lsn if this is the very first stream and
> > txn->final_lsn if it is any consecutive one.
>
> I am not sure about this,  Because for the very first stream we will
> report the location of the first lsn of the stream and for the
> consecutive stream we will report the last lsn in the stream.
>

Yeah, that doesn't seem to be consistent.  How about if get it as an
additional parameter?  The caller can pass the lsn of the very first
change it is trying to decode in this stream.

> >
> > 11.
> > - * HeapTupleSatisfiesHistoricMVCC.
> > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > + *
> > + * We do build the hash table even if there are no CIDs. That's
> > + * because when streaming in-progress transactions we may run into
> > + * tuples with the CID before actually decoding them. Think e.g. about
> > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > + * yet when applying the INSERT. So we build a hash table so that
> > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > + *
> > + * XXX We might limit this behavior to streaming mode, and just bail
> > + * out when decoding transaction at commit time (at which point it's
> > + * guaranteed to see all CIDs).
> >   */
> >  static void
> >  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > *rb, ReorderBufferTXN *txn)
> >   dlist_iter iter;
> >   HASHCTL hash_ctl;
> >
> > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > - return;
> > -
> >
> > I don't understand this change.  Why would "INSERT followed by
> > TRUNCATE" could lead to a tuple which can come for decode before its
> > CID?
>
> Actually, even if we haven't decoded the DDL operation but in the
> actual system table the tuple might have been deleted from the next
> operation.  e.g. while we are streaming the INSERT it is possible that
> the truncate has already deleted that tuple and set the max for the
> tuple.  So before streaming patch, we were only streaming the INSERT
> only on commit so by that time we had got all the operation which has
> done DDL and we would have already prepared tuple CID hash.
>

Okay, but I think for that case how good is that we always allow CID
hash table to be built even if there are no catalog changes in TXN
(see changes in ReorderBufferBuildTupleCidHash).  Can't we detect that
while resolving the cmin/cmax?

Few more comments for v20-0005-Implement-streaming-mode-in-ReorderBuffer:
----------------------------------------------------------------------------------------------------------------
1.
/*
- * Binary heap comparison function.
+ * Binary heap comparison function (regular non-streaming iterator).
  */
 static int
 ReorderBufferIterCompare(Datum a, Datum b, void *arg)

It seems to me the above comment change is not required as per the latest patch.

2.
 * For subtransactions, we only mark them as streamed when there are
+ * any changes in them.
+ *
+ * We do it this way because of aborts - we don't want to send aborts
+ * for XIDs the downstream is not aware of. And of course, it always
+ * knows about the toplevel xact (we send the XID in all messages),
+ * but we never stream XIDs of empty subxacts.
+ */
+ if ((!txn->toptxn) || (txn->nentries_mem != 0))
+ txn->txn_flags |= RBTXN_IS_STREAMED;

/when there are any changes in them/when there are changes in them.  I
think we don't need 'any' in the above sentence.

3.
And, during catalog scan we can check the status of the xid and
+ * if it is aborted we will report a specific error that we can ignore.  We
+ * might have already streamed some of the changes for the aborted
+ * (sub)transaction, but that is fine because when we decode the abort we will
+ * stream abort message to truncate the changes in the subscriber.
+ */
+static inline void
+SetupCheckXidLive(TransactionId xid)

In the above comment, I don't think it is right to say that we ignore
the error raised due to the aborted transaction.  We need to say that
we discard the already streamed changes on such an error.

4.
+static inline void
+SetupCheckXidLive(TransactionId xid)
+{
  /*
- * If this transaction has no snapshot, it didn't make any changes to the
- * database, so there's nothing to decode.  Note that
- * ReorderBufferCommitChild will have transferred any snapshots from
- * subtransactions if there were any.
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.
  */
- if (txn->base_snapshot == NULL)
+ if (!TransactionIdDidCommit(xid))
  {
- Assert(txn->ninvalidations == 0);
- ReorderBufferCleanupTXN(rb, txn);
- return;
+ CheckXidAlive = xid;
+ bsysscan = false;
  }

I think this function is inline as it needs to be called for each
change. If that is the case and otherwise also, isn't it better that
we check if passed xid is the same as CheckXidAlive before checking
TransactionIdDidCommit as TransactionIdDidCommit can be costly and
calling it for each change might not be a good idea?

5.
setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also reset the
+ * sysbegin_called flag.

/if the xid aborted/if the xid is aborted.  missing comma after Also.

6.
ReorderBufferProcessTXN()
{
..
- /* build data to be able to lookup the CommandIds of catalog tuples */
+ /*
+ * build data to be able to lookup the CommandIds of catalog tuples
+ */
  ReorderBufferBuildTupleCidHash(rb, txn);
..
}

Is there a need to change the formatting of the comment?

7.
ReorderBufferProcessTXN()
{
..
  if (using_subtxn)
- BeginInternalSubTransaction("replay");
+ BeginInternalSubTransaction("stream");
  else
  StartTransactionCommand();
..
}

I am not sure changing unconditionally "replay" to "stream" is a good
idea.  How about something like BeginInternalSubTransaction(streaming
? "stream" : "replay");?

8.
@@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
  * use as a normal record. It'll be cleaned up at the end
  * of INSERT processing.
  */
- if (specinsert == NULL)
- elog(ERROR, "invalid ordering of speculative insertion changes");

You have removed this check but all other handling of specinsert is
same as far as this patch is concerned.  Why so?

9.
@@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
  * freed/reused while restoring spooled data from
  * disk.
  */
- Assert(change->data.tp.newtuple != NULL);
-
  dlist_delete(&change->node);

Why is this Assert removed?

10.
@@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
  relations[nrelations++] = relation;
  }

- rb->apply_truncate(rb, txn, nrelations, relations, change);
+ if (streaming)
+ {
+ rb->stream_truncate(rb, txn, nrelations, relations, change);
+
+ /* Remember that we have sent some data. */
+ change->txn->any_data_sent = true;
+ }
+ else
+ rb->apply_truncate(rb, txn, nrelations, relations, change);

Can we encapsulate this in a separate function like
ReorderBufferApplyTruncate or something like that?  Basically, rather
than having streaming check in this function, lets do it in some other
internal function.  And we can likewise do it for all the streaming
checks in this function or at least whereever it is feasible.  That
will make this function look clean.

11.
+ * We currently can only decode a transaction's contents when its commit
+ * record is read because that's the only place where we know about cache
+ * invalidations. Thus, once a toplevel commit is read, we iterate over the top
+ * and subtransactions (using a k-way merge) and replay the changes in lsn
+ * order.
+ */
+void
+ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
{
..

I think the above comment needs to be updated after this patch. This
API can now be used during the decode of both a in-progress and a
committed transaction.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > v20-0003-Extend-the-output-plugin-API-with-stream-methods
> > > ----------------------------------------------------------------------------------------
> > > 1.
> > > +static void
> > > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > > + ReorderBufferTXN *txn,
> > > + Relation relation,
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > > +
> > > +static void
> > > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> > > +   int nrelations, Relation relations[],
> > > +   ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > >
> > > In the above and similar APIs, there are parameters like relation
> > > which are not used.  I think you should add some comments atop these
> > > APIs to explain why it is so? I guess it is because we want to keep
> > > them similar to non-stream version of APIs and we can't display
> > > relation or other information as the transaction is still in-progress.
> >
> > I think because the interfaces are designed that way because other
> > decoding plugins might need it e.g. in pgoutput we need change and
> > relation but not here.  We have other similar examples also e.g.
> > pg_decode_message has the parameter txn but not used.  Do you think we
> > still need to add comments?
> >
>
> In that case, we can leave but lets ensure that we are not exposing
> any parameter which is not used and if there is any due to some
> reason, we should document it. I will also look into this.
>
> > > 4.
> > > +static void
> > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > +{
> > > + LogicalDecodingContext *ctx = cache->private_data;
> > > + LogicalErrorCallbackState state;
> > > + ErrorContextCallback errcallback;
> > > +
> > > + Assert(!ctx->fast_forward);
> > > +
> > > + /* We're only supposed to call this when streaming is supported. */
> > > + Assert(ctx->streaming);
> > > +
> > > + /* Push callback + info on the error context stack */
> > > + state.ctx = ctx;
> > > + state.callback_name = "stream_start";
> > > + /* state.report_location = apply_lsn; */
> > >
> > > Why can't we supply the report_location here?  I think here we need to
> > > report txn->first_lsn if this is the very first stream and
> > > txn->final_lsn if it is any consecutive one.
> >
> > I am not sure about this,  Because for the very first stream we will
> > report the location of the first lsn of the stream and for the
> > consecutive stream we will report the last lsn in the stream.
> >
>
> Yeah, that doesn't seem to be consistent.  How about if get it as an
> additional parameter?  The caller can pass the lsn of the very first
> change it is trying to decode in this stream.

Hmm,  I think we need to call ReorderBufferIterTXNInit and
ReorderBufferIterTXNNext and get the first change of the stream after
that we shall call stream start then we can find out the first LSN of
the stream.   I will see how to do so that it doesn't look awkward.
Basically, as of now, our code is of this layout.

1. stream_start;
2. ReorderBufferIterTXNInit(rb, txn, &iterstate);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
stream changes
}
3. stream stop

So if we want to know the first lsn of this stream then we shall do
something like this

1. ReorderBufferIterTXNInit(rb, txn, &iterstate);
while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
{
    2. if first_change
       stream_start;

   stream changes
}
3. stream stop

> > >
> > > 11.
> > > - * HeapTupleSatisfiesHistoricMVCC.
> > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > > + *
> > > + * We do build the hash table even if there are no CIDs. That's
> > > + * because when streaming in-progress transactions we may run into
> > > + * tuples with the CID before actually decoding them. Think e.g. about
> > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > > + * yet when applying the INSERT. So we build a hash table so that
> > > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > > + *
> > > + * XXX We might limit this behavior to streaming mode, and just bail
> > > + * out when decoding transaction at commit time (at which point it's
> > > + * guaranteed to see all CIDs).
> > >   */
> > >  static void
> > >  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > > *rb, ReorderBufferTXN *txn)
> > >   dlist_iter iter;
> > >   HASHCTL hash_ctl;
> > >
> > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > > - return;
> > > -
> > >
> > > I don't understand this change.  Why would "INSERT followed by
> > > TRUNCATE" could lead to a tuple which can come for decode before its
> > > CID?
> >
> > Actually, even if we haven't decoded the DDL operation but in the
> > actual system table the tuple might have been deleted from the next
> > operation.  e.g. while we are streaming the INSERT it is possible that
> > the truncate has already deleted that tuple and set the max for the
> > tuple.  So before streaming patch, we were only streaming the INSERT
> > only on commit so by that time we had got all the operation which has
> > done DDL and we would have already prepared tuple CID hash.
> >
>
> Okay, but I think for that case how good is that we always allow CID
> hash table to be built even if there are no catalog changes in TXN
> (see changes in ReorderBufferBuildTupleCidHash).  Can't we detect that
> while resolving the cmin/cmax?

Maybe in ResolveCminCmaxDuringDecoding we can see if tuplecid_data is
NULL then we can return as unresolved and then caller can take a call
based on that.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, May 13, 2020 at 9:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > > > 4.
> > > > +static void
> > > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > > +{
> > > > + LogicalDecodingContext *ctx = cache->private_data;
> > > > + LogicalErrorCallbackState state;
> > > > + ErrorContextCallback errcallback;
> > > > +
> > > > + Assert(!ctx->fast_forward);
> > > > +
> > > > + /* We're only supposed to call this when streaming is supported. */
> > > > + Assert(ctx->streaming);
> > > > +
> > > > + /* Push callback + info on the error context stack */
> > > > + state.ctx = ctx;
> > > > + state.callback_name = "stream_start";
> > > > + /* state.report_location = apply_lsn; */
> > > >
> > > > Why can't we supply the report_location here?  I think here we need to
> > > > report txn->first_lsn if this is the very first stream and
> > > > txn->final_lsn if it is any consecutive one.
> > >
> > > I am not sure about this,  Because for the very first stream we will
> > > report the location of the first lsn of the stream and for the
> > > consecutive stream we will report the last lsn in the stream.
> > >
> >
> > Yeah, that doesn't seem to be consistent.  How about if get it as an
> > additional parameter?  The caller can pass the lsn of the very first
> > change it is trying to decode in this stream.
>
> Hmm,  I think we need to call ReorderBufferIterTXNInit and
> ReorderBufferIterTXNNext and get the first change of the stream after
> that we shall call stream start then we can find out the first LSN of
> the stream.   I will see how to do so that it doesn't look awkward.
> Basically, as of now, our code is of this layout.
>
> 1. stream_start;
> 2. ReorderBufferIterTXNInit(rb, txn, &iterstate);
> while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
> {
> stream changes
> }
> 3. stream stop
>
> So if we want to know the first lsn of this stream then we shall do
> something like this
>
> 1. ReorderBufferIterTXNInit(rb, txn, &iterstate);
> while ((change = ReorderBufferIterTXNNext(rb, iterstate)) != NULL)
> {
>     2. if first_change
>        stream_start;
>
>    stream changes
> }
> 3. stream stop
>

Yeah, something like that would work.  I think you need to see it is
first change for 'streaming' mode.

> > > >
> > > > 11.
> > > > - * HeapTupleSatisfiesHistoricMVCC.
> > > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > > > + *
> > > > + * We do build the hash table even if there are no CIDs. That's
> > > > + * because when streaming in-progress transactions we may run into
> > > > + * tuples with the CID before actually decoding them. Think e.g. about
> > > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > > > + * yet when applying the INSERT. So we build a hash table so that
> > > > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > > > + *
> > > > + * XXX We might limit this behavior to streaming mode, and just bail
> > > > + * out when decoding transaction at commit time (at which point it's
> > > > + * guaranteed to see all CIDs).
> > > >   */
> > > >  static void
> > > >  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > > > *rb, ReorderBufferTXN *txn)
> > > >   dlist_iter iter;
> > > >   HASHCTL hash_ctl;
> > > >
> > > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > > > - return;
> > > > -
> > > >
> > > > I don't understand this change.  Why would "INSERT followed by
> > > > TRUNCATE" could lead to a tuple which can come for decode before its
> > > > CID?
> > >
> > > Actually, even if we haven't decoded the DDL operation but in the
> > > actual system table the tuple might have been deleted from the next
> > > operation.  e.g. while we are streaming the INSERT it is possible that
> > > the truncate has already deleted that tuple and set the max for the
> > > tuple.  So before streaming patch, we were only streaming the INSERT
> > > only on commit so by that time we had got all the operation which has
> > > done DDL and we would have already prepared tuple CID hash.
> > >
> >
> > Okay, but I think for that case how good is that we always allow CID
> > hash table to be built even if there are no catalog changes in TXN
> > (see changes in ReorderBufferBuildTupleCidHash).  Can't we detect that
> > while resolving the cmin/cmax?
>
> Maybe in ResolveCminCmaxDuringDecoding we can see if tuplecid_data is
> NULL then we can return as unresolved and then caller can take a call
> based on that.
>

Yeah, and add appropriate comments about why we are doing so and in
what kind of scenario that can happen.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have fixed one more issue in 0010 patch.  The issue was that once
> > the transaction is serialized due to the incomplete toast after
> > streaming the serialized store was not cleaned up so it was streaming
> > the same tuple multiple times.
> >
>
> I have reviewed a few patches (003, 004, and 005) and below are my comments.
>
> v20-0003-Extend-the-output-plugin-API-with-stream-methods
> ----------------------------------------------------------------------------------------
> 2.
> +   <para>
> +    Similar to spill-to-disk behavior, streaming is triggered when the total
> +    amount of changes decoded from the WAL (for all in-progress transactions)
> +    exceeds limit defined by
> <varname>logical_decoding_work_mem</varname> setting.
> +    At that point the largest toplevel transaction (measured by
> amount of memory
> +    currently used for decoded changes) is selected and streamed.
> +   </para>
>
> I think we need to explain here the cases/exception where we need to
> spill even when stream is enabled and check if this is per latest
> implementation, otherwise, update it.

Done

> 3.
> + * To support streaming, we require change/commit/abort callbacks. The
> + * message callback is optional, similarly to regular output plugins.
>
> /similarly/similar

Done

> 4.
> +static void
> +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> +
> + Assert(!ctx->fast_forward);
> +
> + /* We're only supposed to call this when streaming is supported. */
> + Assert(ctx->streaming);
> +
> + /* Push callback + info on the error context stack */
> + state.ctx = ctx;
> + state.callback_name = "stream_start";
> + /* state.report_location = apply_lsn; */
>
> Why can't we supply the report_location here?  I think here we need to
> report txn->first_lsn if this is the very first stream and
> txn->final_lsn if it is any consecutive one.

Done

> 5.
> +static void
> +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> +
> + Assert(!ctx->fast_forward);
> +
> + /* We're only supposed to call this when streaming is supported. */
> + Assert(ctx->streaming);
> +
> + /* Push callback + info on the error context stack */
> + state.ctx = ctx;
> + state.callback_name = "stream_stop";
> + /* state.report_location = apply_lsn; */
>
> Can't we report txn->final_lsn here

We are already setting this to the  txn->final_ls in 0006 patch, but I
have moved it into this patch now.

> 6. I think it will be good if we can provide an example of streaming
> changes via test_decoding at
> https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> can also explain there why the user is not expected to see the actual
> data in the stream.

I have a few problems to solve here.
-  With streaming transaction also shall we show the actual values or
we shall do like it is currently in the patch
(appendStringInfo(ctx->out, "streaming change for TXN %u",
txn->xid);).  I think we should show the actual values instead of what
we are doing now.
- In the example we can not show a real example, because of the
in-progress transaction to show the changes, we might have to
implement a lot of tuple.  I think we can show the partial output?

> v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> ----------------------------------------------------------------------------------------
> 7.
> + /*
> + * We don't expect direct calls to table_tuple_get_latest_tid with valid
> + * CheckXidAlive  for catalog or regular tables.
>
> There is an extra space between 'CheckXidAlive' and 'for'.  I can see
> similar problems in other places as well where this comment is used,
> fix those as well.

Done

> 8.
> +/*
> + * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
> + * transaction.  Currently, it is used in logical decoding.  It's possible
> + * that such transactions can get aborted while the decoding is ongoing in
> + * which case we skip decoding that particular transaction. To ensure that we
> + * check whether the CheckXidAlive is aborted after fetching the tuple from
> + * system tables.  We also ensure that during logical decoding we never
> + * directly access the tableam or heap APIs because we are checking for the
> + * concurrent aborts only in systable_* APIs.
> + */
>
> In this comment, there is an inconsistency in the space used after
> completing the sentence. In the part "transaction. To", single space
> is used whereas at other places two spaces are used after a full stop.

Done


> v20-0005-Implement-streaming-mode-in-ReorderBuffer
> -----------------------------------------------------------------------------
> 9.
> Implement streaming mode in ReorderBuffer
>
> Instead of serializing the transaction to disk after reaching the
> maximum number of changes in memory (4096 changes), we consume the
> changes we have in memory and invoke new stream API methods. This
> happens in ReorderBufferStreamTXN() using about the same logic as
> in ReorderBufferCommit() logic.
>
> I think the above part of the commit message needs to be updated.

Done

> 10.
> Theoretically, we could get rid of the k-way merge, and append the
> changes to the toplevel xact directly (and remember the position
> in the list in case the subxact gets aborted later).
>
> I don't think this part of the commit message is correct as we
> sometimes need to spill even during streaming.  Please check the
> entire commit message and update according to the latest
> implementation.

Done

> 11.
> - * HeapTupleSatisfiesHistoricMVCC.
> + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> + *
> + * We do build the hash table even if there are no CIDs. That's
> + * because when streaming in-progress transactions we may run into
> + * tuples with the CID before actually decoding them. Think e.g. about
> + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> + * yet when applying the INSERT. So we build a hash table so that
> + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> + *
> + * XXX We might limit this behavior to streaming mode, and just bail
> + * out when decoding transaction at commit time (at which point it's
> + * guaranteed to see all CIDs).
>   */
>  static void
>  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> *rb, ReorderBufferTXN *txn)
>   dlist_iter iter;
>   HASHCTL hash_ctl;
>
> - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> - return;
> -
>
> I don't understand this change.  Why would "INSERT followed by
> TRUNCATE" could lead to a tuple which can come for decode before its
> CID?  The patch has made changes based on this assumption in
> HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> behavior could be dependent on whether we are streaming the changes
> for in-progress xact or at the commit of a transaction.  We might want
> to generate a test to once validate this behavior.
>
> Also, the comment refers to tqual.c which is wrong as this API is now
> in heapam_visibility.c.

Done.

> 12.
> + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access.  Also reset the
> + * sysbegin_called flag.
>   */
> - if (txn->base_snapshot == NULL)
> + if (!TransactionIdDidCommit(xid))
>   {
> - Assert(txn->ninvalidations == 0);
> - ReorderBufferCleanupTXN(rb, txn);
> - return;
> + CheckXidAlive = xid;
> + bsysscan = false;
>   }
>
> In the comment, the flag name 'sysbegin_called' should be bsysscan.

Done



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 11:35 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > v20-0003-Extend-the-output-plugin-API-with-stream-methods
> > > ----------------------------------------------------------------------------------------
> > > 1.
> > > +static void
> > > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > > + ReorderBufferTXN *txn,
> > > + Relation relation,
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > > +
> > > +static void
> > > +pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
> > > +   int nrelations, Relation relations[],
> > > +   ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming truncate for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > >
> > > In the above and similar APIs, there are parameters like relation
> > > which are not used.  I think you should add some comments atop these
> > > APIs to explain why it is so? I guess it is because we want to keep
> > > them similar to non-stream version of APIs and we can't display
> > > relation or other information as the transaction is still in-progress.
> >
> > I think because the interfaces are designed that way because other
> > decoding plugins might need it e.g. in pgoutput we need change and
> > relation but not here.  We have other similar examples also e.g.
> > pg_decode_message has the parameter txn but not used.  Do you think we
> > still need to add comments?
> >
>
> In that case, we can leave but lets ensure that we are not exposing
> any parameter which is not used and if there is any due to some
> reason, we should document it. I will also look into this.

Ok

> > > 4.
> > > +static void
> > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > +{
> > > + LogicalDecodingContext *ctx = cache->private_data;
> > > + LogicalErrorCallbackState state;
> > > + ErrorContextCallback errcallback;
> > > +
> > > + Assert(!ctx->fast_forward);
> > > +
> > > + /* We're only supposed to call this when streaming is supported. */
> > > + Assert(ctx->streaming);
> > > +
> > > + /* Push callback + info on the error context stack */
> > > + state.ctx = ctx;
> > > + state.callback_name = "stream_start";
> > > + /* state.report_location = apply_lsn; */
> > >
> > > Why can't we supply the report_location here?  I think here we need to
> > > report txn->first_lsn if this is the very first stream and
> > > txn->final_lsn if it is any consecutive one.
> >
> > I am not sure about this,  Because for the very first stream we will
> > report the location of the first lsn of the stream and for the
> > consecutive stream we will report the last lsn in the stream.
> >
>
> Yeah, that doesn't seem to be consistent.  How about if get it as an
> additional parameter?  The caller can pass the lsn of the very first
> change it is trying to decode in this stream.

Done

> > > 11.
> > > - * HeapTupleSatisfiesHistoricMVCC.
> > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > > + *
> > > + * We do build the hash table even if there are no CIDs. That's
> > > + * because when streaming in-progress transactions we may run into
> > > + * tuples with the CID before actually decoding them. Think e.g. about
> > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > > + * yet when applying the INSERT. So we build a hash table so that
> > > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > > + *
> > > + * XXX We might limit this behavior to streaming mode, and just bail
> > > + * out when decoding transaction at commit time (at which point it's
> > > + * guaranteed to see all CIDs).
> > >   */
> > >  static void
> > >  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > > *rb, ReorderBufferTXN *txn)
> > >   dlist_iter iter;
> > >   HASHCTL hash_ctl;
> > >
> > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > > - return;
> > > -
> > >
> > > I don't understand this change.  Why would "INSERT followed by
> > > TRUNCATE" could lead to a tuple which can come for decode before its
> > > CID?
> >
> > Actually, even if we haven't decoded the DDL operation but in the
> > actual system table the tuple might have been deleted from the next
> > operation.  e.g. while we are streaming the INSERT it is possible that
> > the truncate has already deleted that tuple and set the max for the
> > tuple.  So before streaming patch, we were only streaming the INSERT
> > only on commit so by that time we had got all the operation which has
> > done DDL and we would have already prepared tuple CID hash.
> >
>
> Okay, but I think for that case how good is that we always allow CID
> hash table to be built even if there are no catalog changes in TXN
> (see changes in ReorderBufferBuildTupleCidHash).  Can't we detect that
> while resolving the cmin/cmax?

Done

>
> Few more comments for v20-0005-Implement-streaming-mode-in-ReorderBuffer:
> ----------------------------------------------------------------------------------------------------------------
> 1.
> /*
> - * Binary heap comparison function.
> + * Binary heap comparison function (regular non-streaming iterator).
>   */
>  static int
>  ReorderBufferIterCompare(Datum a, Datum b, void *arg)
>
> It seems to me the above comment change is not required as per the latest patch.

Done

> 2.
>  * For subtransactions, we only mark them as streamed when there are
> + * any changes in them.
> + *
> + * We do it this way because of aborts - we don't want to send aborts
> + * for XIDs the downstream is not aware of. And of course, it always
> + * knows about the toplevel xact (we send the XID in all messages),
> + * but we never stream XIDs of empty subxacts.
> + */
> + if ((!txn->toptxn) || (txn->nentries_mem != 0))
> + txn->txn_flags |= RBTXN_IS_STREAMED;
>
> /when there are any changes in them/when there are changes in them.  I
> think we don't need 'any' in the above sentence.

Done

> 3.
> And, during catalog scan we can check the status of the xid and
> + * if it is aborted we will report a specific error that we can ignore.  We
> + * might have already streamed some of the changes for the aborted
> + * (sub)transaction, but that is fine because when we decode the abort we will
> + * stream abort message to truncate the changes in the subscriber.
> + */
> +static inline void
> +SetupCheckXidLive(TransactionId xid)
>
> In the above comment, I don't think it is right to say that we ignore
> the error raised due to the aborted transaction.  We need to say that
> we discard the already streamed changes on such an error.

Done.

> 4.
> +static inline void
> +SetupCheckXidLive(TransactionId xid)
> +{
>   /*
> - * If this transaction has no snapshot, it didn't make any changes to the
> - * database, so there's nothing to decode.  Note that
> - * ReorderBufferCommitChild will have transferred any snapshots from
> - * subtransactions if there were any.
> + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access.  Also reset the
> + * sysbegin_called flag.
>   */
> - if (txn->base_snapshot == NULL)
> + if (!TransactionIdDidCommit(xid))
>   {
> - Assert(txn->ninvalidations == 0);
> - ReorderBufferCleanupTXN(rb, txn);
> - return;
> + CheckXidAlive = xid;
> + bsysscan = false;
>   }
>
> I think this function is inline as it needs to be called for each
> change. If that is the case and otherwise also, isn't it better that
> we check if passed xid is the same as CheckXidAlive before checking
> TransactionIdDidCommit as TransactionIdDidCommit can be costly and
> calling it for each change might not be a good idea?

Done,  Also I think it is good the check the TransactionIdIsInProgress
instead of !TransactionIdDidCommit.  I have changed that as well.

> 5.
> setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access.  Also reset the
> + * sysbegin_called flag.
>
> /if the xid aborted/if the xid is aborted.  missing comma after Also.

Done

> 6.
> ReorderBufferProcessTXN()
> {
> ..
> - /* build data to be able to lookup the CommandIds of catalog tuples */
> + /*
> + * build data to be able to lookup the CommandIds of catalog tuples
> + */
>   ReorderBufferBuildTupleCidHash(rb, txn);
> ..
> }
>
> Is there a need to change the formatting of the comment?

No need changed back.

>
> 7.
> ReorderBufferProcessTXN()
> {
> ..
>   if (using_subtxn)
> - BeginInternalSubTransaction("replay");
> + BeginInternalSubTransaction("stream");
>   else
>   StartTransactionCommand();
> ..
> }
>
> I am not sure changing unconditionally "replay" to "stream" is a good
> idea.  How about something like BeginInternalSubTransaction(streaming
> ? "stream" : "replay");?

Done

> 8.
> @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
>   * use as a normal record. It'll be cleaned up at the end
>   * of INSERT processing.
>   */
> - if (specinsert == NULL)
> - elog(ERROR, "invalid ordering of speculative insertion changes");
>
> You have removed this check but all other handling of specinsert is
> same as far as this patch is concerned.  Why so?

Seems like a merge issue, or the leftover from the old design of the
toast handling where we were streaming with the partial tuple.
fixed now.

> 9.
> @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
>   * freed/reused while restoring spooled data from
>   * disk.
>   */
> - Assert(change->data.tp.newtuple != NULL);
> -
>   dlist_delete(&change->node);
>
> Why is this Assert removed?

Same cause as above so fixed.

> 10.
> @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
>   relations[nrelations++] = relation;
>   }
>
> - rb->apply_truncate(rb, txn, nrelations, relations, change);
> + if (streaming)
> + {
> + rb->stream_truncate(rb, txn, nrelations, relations, change);
> +
> + /* Remember that we have sent some data. */
> + change->txn->any_data_sent = true;
> + }
> + else
> + rb->apply_truncate(rb, txn, nrelations, relations, change);
>
> Can we encapsulate this in a separate function like
> ReorderBufferApplyTruncate or something like that?  Basically, rather
> than having streaming check in this function, lets do it in some other
> internal function.  And we can likewise do it for all the streaming
> checks in this function or at least whereever it is feasible.  That
> will make this function look clean.

Done for truncate and change.  I think we can create a few more such
functions for
start/stop and cleanup handling on error.  I will work on that.

> 11.
> + * We currently can only decode a transaction's contents when its commit
> + * record is read because that's the only place where we know about cache
> + * invalidations. Thus, once a toplevel commit is read, we iterate over the top
> + * and subtransactions (using a k-way merge) and replay the changes in lsn
> + * order.
> + */
> +void
> +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> {
> ..
>
> I think the above comment needs to be updated after this patch. This
> API can now be used during the decode of both a in-progress and a
> committed transaction.

Done


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > 6. I think it will be good if we can provide an example of streaming
> > changes via test_decoding at
> > https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> > can also explain there why the user is not expected to see the actual
> > data in the stream.
>
> I have a few problems to solve here.
> -  With streaming transaction also shall we show the actual values or
> we shall do like it is currently in the patch
> (appendStringInfo(ctx->out, "streaming change for TXN %u",
> txn->xid);).  I think we should show the actual values instead of what
> we are doing now.
>

I think why we don't want to display the tuple at this stage is
because it is not clear by this time if the transaction will commit or
abort.  I am not sure if displaying the contents of aborted
transactions is a good idea but if there is a reason for doing so, we
can do it later as well.

> - In the example we can not show a real example, because of the
> in-progress transaction to show the changes, we might have to
> implement a lot of tuple.  I think we can show the partial output?
>

I think we can display what API will actually display, what is the
confusion here.

I have a few more comments on the previous version of patch
v20-0005-Implement-streaming-mode-in-ReorderBuffer.  If you have fixed
any, then leave those and fix others.

Review comments:
------------------------------
1.
@@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
TransactionId xid,
  }

  case REORDER_BUFFER_CHANGE_MESSAGE:
- rb->message(rb, txn, change->lsn, true,
- change->data.msg.prefix,
- change->data.msg.message_size,
- change->data.msg.message);
+ if (streaming)
+ rb->stream_message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);
+ else
+ rb->message(rb, txn, change->lsn, true,
+    change->data.msg.prefix,
+    change->data.msg.message_size,
+    change->data.msg.message);

Don't we need to set any_data_sent flag while streaming messages as we
do for other types of changes?

2.
+ if (streaming)
+ {
+ /*
+ * Set the last of the stream as the final lsn before calling
+ * stream stop.
+ */
+ if (!XLogRecPtrIsInvalid(prev_lsn))
+ txn->final_lsn = prev_lsn;
+ rb->stream_stop(rb, txn);
+ }

I am not sure if it is good to use final_lsn for this purpose.  See
comments for this variable in reorderbuffer.h.  Basically, it is used
for a specific purpose on different occasions.  Now, if we want to
start using it for a new purpose, we need to study its interaction
with all other places and update the comments as well.  Can we pass an
additional parameter to stream_stop() instead?

3.
+ /* remember the command ID and snapshot for the streaming run */
+ txn->command_id = command_id;
+
+ /* Avoid copying if it's already copied. */
+ if (snapshot_now->copied)
+ txn->snapshot_now = snapshot_now;
+ else
+ txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
+   txn, command_id);

This code is used at two different places, can we try to keep this in
a single function.

4.
In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
the try and catch block.  If there is an error after calling it in a
try block, we might call it again via catch.  I think that will lead
to sending a stop message twice.  Won't that be a problem?  See the
usage of iterstate in the catch block, we have made it safe from a
similar problem.

5.
+ if (streaming)
+ {
+ /* Discard the changes that we just streamed. */
+ ReorderBufferTruncateTXN(rb, txn);

- PG_RE_THROW();
+ /* Re-throw only if it's not an abort. */
+ if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
+ {
+ MemoryContextSwitchTo(ecxt);
+ PG_RE_THROW();
+ }
+ else
+ {
+ FlushErrorState();
+ FreeErrorData(errdata);
+ errdata = NULL;
+

I think here we can write few comments on why we are doing error-code
specific handling, basically, explain a bit about concurrent abort
handling and or refer to the part of comments where it is explained.

6.
PG_CATCH();
  {
+ MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
+ ErrorData  *errdata = CopyErrorData();

I don't understand the usage of memory context in this part of the
code.  Basically, you are switching to CurrentMemoryContext here, do
some error handling and then again reset back to some random context
before rethrowing the error.  If there is some purpose for it, then it
might be better if you can write a few comments to explain the same.

7.
+ReorderBufferCommit()
{
..
+ /*
+ * If the transaction was (partially) streamed, we need to commit it in a
+ * 'streamed' way. That is, we first stream the remaining part of the
+ * transaction, and then invoke stream_commit message.
+ *
+ * XXX Called after everything (origin ID and LSN, ...) is stored in the
+ * transaction, so we don't pass that directly.
+ *
+ * XXX Somewhat hackish redirection, perhaps needs to be refactored?
+ */
+ if (rbtxn_is_streamed(txn))
+ {
+ ReorderBufferStreamCommit(rb, txn);
+ return;
+ }
+
..
}

"XXX Somewhat hackish redirection, perhaps needs to be refactored?"
What kind of refactoring we can do here?  To me, it looks okay.

8.
@@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
*rb, TransactionId xid,
  txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

  txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
+
+ /*
+ * TOCHECK: Mark toplevel transaction as having catalog changes too
+ * if one of its children has.
+ */
+ if (txn->toptxn != NULL)
+ txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
 }

Why are we marking top transaction here?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > > 6. I think it will be good if we can provide an example of streaming
> > > changes via test_decoding at
> > > https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> > > can also explain there why the user is not expected to see the actual
> > > data in the stream.
> >
> > I have a few problems to solve here.
> > -  With streaming transaction also shall we show the actual values or
> > we shall do like it is currently in the patch
> > (appendStringInfo(ctx->out, "streaming change for TXN %u",
> > txn->xid);).  I think we should show the actual values instead of what
> > we are doing now.
> >
>
> I think why we don't want to display the tuple at this stage is
> because it is not clear by this time if the transaction will commit or
> abort.  I am not sure if displaying the contents of aborted
> transactions is a good idea but if there is a reason for doing so, we
> can do it later as well.

Ok.

>
> > - In the example we can not show a real example, because of the
> > in-progress transaction to show the changes, we might have to
> > implement a lot of tuple.  I think we can show the partial output?
> >
>
> I think we can display what API will actually display, what is the
> confusion here.

What, I meant is that even with the logical_decoding_work_mem=64kb, we
need to have quite a few changes in a transaction to stream it so the
example output will be quite big in size.  So I told we might not show
the real example instead we will just show a few lines and cut the
remaining.  But, I got your point we can just show how it will look
like.

>
> I have a few more comments on the previous version of patch
> v20-0005-Implement-streaming-mode-in-ReorderBuffer.  If you have fixed
> any, then leave those and fix others.
>
> Review comments:
> ------------------------------
> 1.
> @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> TransactionId xid,
>   }
>
>   case REORDER_BUFFER_CHANGE_MESSAGE:
> - rb->message(rb, txn, change->lsn, true,
> - change->data.msg.prefix,
> - change->data.msg.message_size,
> - change->data.msg.message);
> + if (streaming)
> + rb->stream_message(rb, txn, change->lsn, true,
> +    change->data.msg.prefix,
> +    change->data.msg.message_size,
> +    change->data.msg.message);
> + else
> + rb->message(rb, txn, change->lsn, true,
> +    change->data.msg.prefix,
> +    change->data.msg.message_size,
> +    change->data.msg.message);
>
> Don't we need to set any_data_sent flag while streaming messages as we
> do for other types of changes?

Actually, pgoutput plugin don't send any data on stream_message.  But,
I agree that how other plugin will handle.  I will analyze this part
again, maybe we have to such flag at the plugin level and whether stop
is sent to not can also be handled at the plugin level.

> 2.
> + if (streaming)
> + {
> + /*
> + * Set the last of the stream as the final lsn before calling
> + * stream stop.
> + */
> + if (!XLogRecPtrIsInvalid(prev_lsn))
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> + }
>
> I am not sure if it is good to use final_lsn for this purpose.  See
> comments for this variable in reorderbuffer.h.  Basically, it is used
> for a specific purpose on different occasions.  Now, if we want to
> start using it for a new purpose, we need to study its interaction
> with all other places and update the comments as well.  Can we pass an
> additional parameter to stream_stop() instead?

I think it was in sycn with the spill code right? I mean the last
change we spill is set as the final_lsn and same is done here.

Other comments looks fine so I will work on them and reply separatly.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 15, 2020 at 4:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> >
> > > - In the example we can not show a real example, because of the
> > > in-progress transaction to show the changes, we might have to
> > > implement a lot of tuple.  I think we can show the partial output?
> > >
> >
> > I think we can display what API will actually display, what is the
> > confusion here.
>
> What, I meant is that even with the logical_decoding_work_mem=64kb, we
> need to have quite a few changes in a transaction to stream it so the
> example output will be quite big in size.  So I told we might not show
> the real example instead we will just show a few lines and cut the
> remaining.  But, I got your point we can just show how it will look
> like.
>

Right.

> >
> > I have a few more comments on the previous version of patch
> > v20-0005-Implement-streaming-mode-in-ReorderBuffer.  If you have fixed
> > any, then leave those and fix others.
> >
> > Review comments:
> > ------------------------------
> > 1.
> > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > TransactionId xid,
> >   }
> >
> >   case REORDER_BUFFER_CHANGE_MESSAGE:
> > - rb->message(rb, txn, change->lsn, true,
> > - change->data.msg.prefix,
> > - change->data.msg.message_size,
> > - change->data.msg.message);
> > + if (streaming)
> > + rb->stream_message(rb, txn, change->lsn, true,
> > +    change->data.msg.prefix,
> > +    change->data.msg.message_size,
> > +    change->data.msg.message);
> > + else
> > + rb->message(rb, txn, change->lsn, true,
> > +    change->data.msg.prefix,
> > +    change->data.msg.message_size,
> > +    change->data.msg.message);
> >
> > Don't we need to set any_data_sent flag while streaming messages as we
> > do for other types of changes?
>
> Actually, pgoutput plugin don't send any data on stream_message.  But,
> I agree that how other plugin will handle.  I will analyze this part
> again, maybe we have to such flag at the plugin level and whether stop
> is sent to not can also be handled at the plugin level.
>

Okay, lets discuss this after your analysis.

> > 2.
> > + if (streaming)
> > + {
> > + /*
> > + * Set the last of the stream as the final lsn before calling
> > + * stream stop.
> > + */
> > + if (!XLogRecPtrIsInvalid(prev_lsn))
> > + txn->final_lsn = prev_lsn;
> > + rb->stream_stop(rb, txn);
> > + }
> >
> > I am not sure if it is good to use final_lsn for this purpose.  See
> > comments for this variable in reorderbuffer.h.  Basically, it is used
> > for a specific purpose on different occasions.  Now, if we want to
> > start using it for a new purpose, we need to study its interaction
> > with all other places and update the comments as well.  Can we pass an
> > additional parameter to stream_stop() instead?
>
> I think it was in sycn with the spill code right? I mean the last
> change we spill is set as the final_lsn and same is done here.
>

But we use final_lsn in ReorderBufferRestoreCleanup() for serialized
changes.  Now, in some case if we first do serialization, then perform
streaming and then tried to call ReorderBufferRestoreCleanup(), it
might not work as intended.  Now, this might not happen today but I
don't think we have any protection to avoid that.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, May 15, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 4:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > >
> > > > - In the example we can not show a real example, because of the
> > > > in-progress transaction to show the changes, we might have to
> > > > implement a lot of tuple.  I think we can show the partial output?
> > > >
> > >
> > > I think we can display what API will actually display, what is the
> > > confusion here.
> >
> > What, I meant is that even with the logical_decoding_work_mem=64kb, we
> > need to have quite a few changes in a transaction to stream it so the
> > example output will be quite big in size.  So I told we might not show
> > the real example instead we will just show a few lines and cut the
> > remaining.  But, I got your point we can just show how it will look
> > like.
> >
>
> Right.
>
> > >
> > > I have a few more comments on the previous version of patch
> > > v20-0005-Implement-streaming-mode-in-ReorderBuffer.  If you have fixed
> > > any, then leave those and fix others.
> > >
> > > Review comments:
> > > ------------------------------
> > > 1.
> > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > > TransactionId xid,
> > >   }
> > >
> > >   case REORDER_BUFFER_CHANGE_MESSAGE:
> > > - rb->message(rb, txn, change->lsn, true,
> > > - change->data.msg.prefix,
> > > - change->data.msg.message_size,
> > > - change->data.msg.message);
> > > + if (streaming)
> > > + rb->stream_message(rb, txn, change->lsn, true,
> > > +    change->data.msg.prefix,
> > > +    change->data.msg.message_size,
> > > +    change->data.msg.message);
> > > + else
> > > + rb->message(rb, txn, change->lsn, true,
> > > +    change->data.msg.prefix,
> > > +    change->data.msg.message_size,
> > > +    change->data.msg.message);
> > >
> > > Don't we need to set any_data_sent flag while streaming messages as we
> > > do for other types of changes?
> >
> > Actually, pgoutput plugin don't send any data on stream_message.  But,
> > I agree that how other plugin will handle.  I will analyze this part
> > again, maybe we have to such flag at the plugin level and whether stop
> > is sent to not can also be handled at the plugin level.
> >
>
> Okay, lets discuss this after your analysis.
>
> > > 2.
> > > + if (streaming)
> > > + {
> > > + /*
> > > + * Set the last of the stream as the final lsn before calling
> > > + * stream stop.
> > > + */
> > > + if (!XLogRecPtrIsInvalid(prev_lsn))
> > > + txn->final_lsn = prev_lsn;
> > > + rb->stream_stop(rb, txn);
> > > + }
> > >
> > > I am not sure if it is good to use final_lsn for this purpose.  See
> > > comments for this variable in reorderbuffer.h.  Basically, it is used
> > > for a specific purpose on different occasions.  Now, if we want to
> > > start using it for a new purpose, we need to study its interaction
> > > with all other places and update the comments as well.  Can we pass an
> > > additional parameter to stream_stop() instead?
> >
> > I think it was in sycn with the spill code right? I mean the last
> > change we spill is set as the final_lsn and same is done here.
> >
>
> But we use final_lsn in ReorderBufferRestoreCleanup() for serialized
> changes.  Now, in some case if we first do serialization, then perform
> streaming and then tried to call ReorderBufferRestoreCleanup(),it
> might not work as intended.  Now, this might not happen today but I
> don't think we have any protection to avoid that.

If streaming is complete then we will remove the serialize flag so it
will not cause any issue.  However, we can avoid setting final_lsn
here and pass some parameters to the stream_stop about the last lsn of
the stream.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > > 6. I think it will be good if we can provide an example of streaming
> > > changes via test_decoding at
> > > https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> > > can also explain there why the user is not expected to see the actual
> > > data in the stream.
> >
> > I have a few problems to solve here.
> > -  With streaming transaction also shall we show the actual values or
> > we shall do like it is currently in the patch
> > (appendStringInfo(ctx->out, "streaming change for TXN %u",
> > txn->xid);).  I think we should show the actual values instead of what
> > we are doing now.
> >
>
> I think why we don't want to display the tuple at this stage is
> because it is not clear by this time if the transaction will commit or
> abort.  I am not sure if displaying the contents of aborted
> transactions is a good idea but if there is a reason for doing so, we
> can do it later as well.
>
> > - In the example we can not show a real example, because of the
> > in-progress transaction to show the changes, we might have to
> > implement a lot of tuple.  I think we can show the partial output?
> >
>
> I think we can display what API will actually display, what is the
> confusion here.

Added example in the v22-0011 patch where I have added the API to get
streaming changes.

> I have a few more comments on the previous version of patch
> v20-0005-Implement-streaming-mode-in-ReorderBuffer.  If you have fixed
> any, then leave those and fix others.
>
> Review comments:
> ------------------------------
> 1.
> @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> TransactionId xid,
>   }
>
>   case REORDER_BUFFER_CHANGE_MESSAGE:
> - rb->message(rb, txn, change->lsn, true,
> - change->data.msg.prefix,
> - change->data.msg.message_size,
> - change->data.msg.message);
> + if (streaming)
> + rb->stream_message(rb, txn, change->lsn, true,
> +    change->data.msg.prefix,
> +    change->data.msg.message_size,
> +    change->data.msg.message);
> + else
> + rb->message(rb, txn, change->lsn, true,
> +    change->data.msg.prefix,
> +    change->data.msg.message_size,
> +    change->data.msg.message);
>
> Don't we need to set any_data_sent flag while streaming messages as we
> do for other types of changes?

I think any_data_sent, was added to avoid sending abort to the
subscriber if we haven't sent any data,  but this is not complete as
the output plugin can also take the decision not to send.  So I think
this should not be done as part of this patch and can be done
separately.  I think there is already a thread for handling the
same[1]


> 2.
> + if (streaming)
> + {
> + /*
> + * Set the last of the stream as the final lsn before calling
> + * stream stop.
> + */
> + if (!XLogRecPtrIsInvalid(prev_lsn))
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> + }
>
> I am not sure if it is good to use final_lsn for this purpose.  See
> comments for this variable in reorderbuffer.h.  Basically, it is used
> for a specific purpose on different occasions.  Now, if we want to
> start using it for a new purpose, we need to study its interaction
> with all other places and update the comments as well.  Can we pass an
> additional parameter to stream_stop() instead?

Done

> 3.
> + /* remember the command ID and snapshot for the streaming run */
> + txn->command_id = command_id;
> +
> + /* Avoid copying if it's already copied. */
> + if (snapshot_now->copied)
> + txn->snapshot_now = snapshot_now;
> + else
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +   txn, command_id);
>
> This code is used at two different places, can we try to keep this in
> a single function.

Done

> 4.
> In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
> the try and catch block.  If there is an error after calling it in a
> try block, we might call it again via catch.  I think that will lead
> to sending a stop message twice.  Won't that be a problem?  See the
> usage of iterstate in the catch block, we have made it safe from a
> similar problem.

IMHO, we don't need that, because we only call stream_stop in the
catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK.  So if
in TRY block we have already stopped the stream then we should not get
that error.  I have added the comments for the same.

> 5.
> + if (streaming)
> + {
> + /* Discard the changes that we just streamed. */
> + ReorderBufferTruncateTXN(rb, txn);
>
> - PG_RE_THROW();
> + /* Re-throw only if it's not an abort. */
> + if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
> + {
> + MemoryContextSwitchTo(ecxt);
> + PG_RE_THROW();
> + }
> + else
> + {
> + FlushErrorState();
> + FreeErrorData(errdata);
> + errdata = NULL;
> +
>
> I think here we can write few comments on why we are doing error-code
> specific handling, basically, explain a bit about concurrent abort
> handling and or refer to the part of comments where it is explained.

Done

> 6.
> PG_CATCH();
>   {
> + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> + ErrorData  *errdata = CopyErrorData();
>
> I don't understand the usage of memory context in this part of the
> code.  Basically, you are switching to CurrentMemoryContext here, do
> some error handling and then again reset back to some random context
> before rethrowing the error.  If there is some purpose for it, then it
> might be better if you can write a few comments to explain the same.

Basically, the ccxt is the CurrentMemoryContext when we started the
streaming and ecxt it the context when we catch the error.  So
ideally, before this change, it will rethrow in the context when we
catch the error i.e. ecxt.  So what we are trying to do is put it back
to normal context (ccxt) and copy the error data in the normal
context.  And, if we are not handling it gracefully then put it back
to the context it was in, and rethrow.

>
> 7.
> +ReorderBufferCommit()
> {
> ..
> + /*
> + * If the transaction was (partially) streamed, we need to commit it in a
> + * 'streamed' way. That is, we first stream the remaining part of the
> + * transaction, and then invoke stream_commit message.
> + *
> + * XXX Called after everything (origin ID and LSN, ...) is stored in the
> + * transaction, so we don't pass that directly.
> + *
> + * XXX Somewhat hackish redirection, perhaps needs to be refactored?
> + */
> + if (rbtxn_is_streamed(txn))
> + {
> + ReorderBufferStreamCommit(rb, txn);
> + return;
> + }
> +
> ..
> }
>
> "XXX Somewhat hackish redirection, perhaps needs to be refactored?"
> What kind of refactoring we can do here?  To me, it looks okay.

I think it looks fine to me also.  So I have removed this comment.

> 8.
> @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> *rb, TransactionId xid,
>   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
>
>   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> +
> + /*
> + * TOCHECK: Mark toplevel transaction as having catalog changes too
> + * if one of its children has.
> + */
> + if (txn->toptxn != NULL)
> + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
>  }
>
> Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not.  In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

[1] https://www.postgresql.org/message-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Review comments:
> > ------------------------------
> > 1.
> > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > TransactionId xid,
> >   }
> >
> >   case REORDER_BUFFER_CHANGE_MESSAGE:
> > - rb->message(rb, txn, change->lsn, true,
> > - change->data.msg.prefix,
> > - change->data.msg.message_size,
> > - change->data.msg.message);
> > + if (streaming)
> > + rb->stream_message(rb, txn, change->lsn, true,
> > +    change->data.msg.prefix,
> > +    change->data.msg.message_size,
> > +    change->data.msg.message);
> > + else
> > + rb->message(rb, txn, change->lsn, true,
> > +    change->data.msg.prefix,
> > +    change->data.msg.message_size,
> > +    change->data.msg.message);
> >
> > Don't we need to set any_data_sent flag while streaming messages as we
> > do for other types of changes?
>
> I think any_data_sent, was added to avoid sending abort to the
> subscriber if we haven't sent any data,  but this is not complete as
> the output plugin can also take the decision not to send.  So I think
> this should not be done as part of this patch and can be done
> separately.  I think there is already a thread for handling the
> same[1]
>

Hmm, but prior to this patch, we never use to send (empty) aborts but
now that will be possible. It is probably okay to deal that with
another patch mentioned by you but I felt at least any_data_sent will
work for some cases.  OTOH, it appears to be half-baked solution, so
we should probably refrain from adding it.  BTW, how do the pgoutput
plugin deal with it? I see that apply_handle_stream_abort will
unconditionally try to unlink the file and it will probably fail.
Have you tested this scenario after your latest changes?

>
> > 4.
> > In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
> > the try and catch block.  If there is an error after calling it in a
> > try block, we might call it again via catch.  I think that will lead
> > to sending a stop message twice.  Won't that be a problem?  See the
> > usage of iterstate in the catch block, we have made it safe from a
> > similar problem.
>
> IMHO, we don't need that, because we only call stream_stop in the
> catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK.  So if
> in TRY block we have already stopped the stream then we should not get
> that error.  I have added the comments for the same.
>

I am still slightly nervous about it as I don't see any solid
guarantee for the same.  You are right as the code stands today but
due to any code that gets added in the future, it might not remain
true. I feel it is better to have an Assert here to ensure that
stream_stop won't be called the second time.  I don't see any good way
of doing it other than by maintaining flag or some state but I think
it will be good to ensure this.

>
> > 6.
> > PG_CATCH();
> >   {
> > + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> > + ErrorData  *errdata = CopyErrorData();
> >
> > I don't understand the usage of memory context in this part of the
> > code.  Basically, you are switching to CurrentMemoryContext here, do
> > some error handling and then again reset back to some random context
> > before rethrowing the error.  If there is some purpose for it, then it
> > might be better if you can write a few comments to explain the same.
>
> Basically, the ccxt is the CurrentMemoryContext when we started the
> streaming and ecxt it the context when we catch the error.  So
> ideally, before this change, it will rethrow in the context when we
> catch the error i.e. ecxt.  So what we are trying to do is put it back
> to normal context (ccxt) and copy the error data in the normal
> context.  And, if we are not handling it gracefully then put it back
> to the context it was in, and rethrow.
>

Okay, but when errorcode is *not* ERRCODE_TRANSACTION_ROLLBACK, don't
we need to clean up the reorderbuffer by calling
ReorderBufferCleanupTXN?  If so, then you can try to combine it with
the not-streaming else loop.

>
> > 8.
> > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > *rb, TransactionId xid,
> >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> >
> >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > +
> > + /*
> > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > + * if one of its children has.
> > + */
> > + if (txn->toptxn != NULL)
> > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> >  }
> >
> > Why are we marking top transaction here?
>
> We need to mark top transaction to decide whether to build tuplecid
> hash or not.  In non-streaming mode, we are only sending during the
> commit time, and during commit time we know whether the top
> transaction has any catalog changes or not based on the invalidation
> message so we are marking the top transaction there in DecodeCommit.
> Since here we are not waiting till commit so we need to mark the top
> transaction as soon as we mark any of its child transactions.
>

But how does it help?  We use this flag (via
ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
anyway done in DecodeCommit and that too after setting this flag for
the top transaction if required.  So, how will it help in setting it
while processing for subxid.  Also, even if we have to do it won't it
add the xid needlessly in builder->committed.xip array?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
1.
+ /*
+ * If this is a toast insert then set the corresponding bit.  Otherwise, if
+ * we have toast insert bit set and this is insert/update then clear the
+ * bit.
+ */
+ if (toast_insert)
+ toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {

Here, it might better to add a comment on why we expect only
Insert/Update?  Also, it might be better that we add an assert for
other operations.

2.
@@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
  * disk.
  */
  dlist_delete(&change->node);
- ReorderBufferToastAppendChunk(rb, txn, relation,
-   change);
+ ReorderBufferToastAppendChunk(rb, txn, relation,
+   change);
  }

This seems to be a spurious change.

3.
+ /*
+ * If streaming is enable and we have serialized this transaction because
+ * it had incomplete tuple.  So if now we have got the complete tuple we
+ * can stream it.
+ */
+ if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
+ && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
+ {

This comment is just saying what you are doing in the if-check.  I
think you need to explain the rationale behind it. I don't like the
variable name 'can_stream' because it matches ReorderBufferCanStream
whereas it is for a different purpose, how about naming it as
'change_complete' or something like that.  The check has many
conditions, can we move it to a separate function to make the code
here look clean?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> 3.
> + /*
> + * If streaming is enable and we have serialized this transaction because
> + * it had incomplete tuple.  So if now we have got the complete tuple we
> + * can stream it.
> + */
> + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> + {
>
> This comment is just saying what you are doing in the if-check.  I
> think you need to explain the rationale behind it. I don't like the
> variable name 'can_stream' because it matches ReorderBufferCanStream
> whereas it is for a different purpose, how about naming it as
> 'change_complete' or something like that.  The check has many
> conditions, can we move it to a separate function to make the code
> here look clean?
>

Do we really need this?  Immediately after this check, we are calling
ReorderBufferCheckMemoryLimit which will anyway stream the changes if
required.  Can we move the changes related to the detection of
incomplete data to a separate function?

Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:

+ else if (rbtxn_has_toast_insert(txn) &&
+ ChangeIsInsertOrUpdate(change->action))
+ {
+ toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
+ can_stream = true;
+ }
..
+#define ChangeIsInsertOrUpdate(action) \
+ (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
+ ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
+ ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))

How can we clear the RBTXN_HAS_TOAST_INSERT flag on
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?

IIUC, the basic idea used to handle incomplete changes (which is
possible in case of toast tuples and speculative inserts) is to mark
such TXNs as containing incomplete changes and then while finding the
largest top-level TXN for streaming, we ignore such TXN's and move to
next largest TXN.  If none of the TXNs have complete changes then we
choose the largest (sub)transaction and spill the same to make the
in-memory changes below logical_decoding_work_mem threshold.  This
idea can work but the strategy to choose the transaction is suboptimal
for cases where TXNs have some changes which are complete followed by
an incomplete toast or speculative tuple.  I was having an offlist
discussion with Robert on this problem and he suggested that it would
be better if we track the complete part of changes separately and then
we can avoid the drawback mentioned above.  I have thought about this
and I think it can work if we track the size and LSN of completed
changes.  I think we need to ensure that if there is concurrent abort
then we discard all changes for current (sub)transaction not only up
to completed changes LSN whereas if the streaming is successful then
we can truncate the changes only up to completed changes LSN. What do
you think?

I wonder why you have done this as 0010 in the patch series, it should
be as 0006 after the
0005-Implement-streaming-mode-in-ReorderBuffer.patch.  If we can do
that way then it would be easier for me to review.  Is there a reason
for not doing so?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 3.
> > + /*
> > + * If streaming is enable and we have serialized this transaction because
> > + * it had incomplete tuple.  So if now we have got the complete tuple we
> > + * can stream it.
> > + */
> > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> > + {
> >
> > This comment is just saying what you are doing in the if-check.  I
> > think you need to explain the rationale behind it. I don't like the
> > variable name 'can_stream' because it matches ReorderBufferCanStream
> > whereas it is for a different purpose, how about naming it as
> > 'change_complete' or something like that.  The check has many
> > conditions, can we move it to a separate function to make the code
> > here look clean?
> >
>
> Do we really need this?  Immediately after this check, we are calling
> ReorderBufferCheckMemoryLimit which will anyway stream the changes if
> required.

Actually, ReorderBufferCheckMemoryLimit is only meant for checking
whether we need to stream the changes due to the memory limit.  But
suppose when memory limit exceeds that time we could not stream the
transaction because there was only incomplete toast insert so we
serialized.  Now,  when we get the tuple which makes the changes
complete but now it is not crossing the memory limit as changes were
already serialized.  So I am not sure whether it is a good idea to
stream the transaction as soon as we get the complete changes or we
shall wait till next time memory limit exceed and that time we select
the suitable candidate.  Ideally, we were are in streaming more and
the transaction is serialized means it was already a candidate for
streaming but could not stream due to the incomplete changes so
shouldn't we stream it immediately as soon as its changes are complete
even though now we are in memory limit.  Because our target is to
stream not spill so we should try to stream the spilled changes on the
first opportunity.

  Can we move the changes related to the detection of
> incomplete data to a separate function?

Ok.


>
> Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:
>
> + else if (rbtxn_has_toast_insert(txn) &&
> + ChangeIsInsertOrUpdate(change->action))
> + {
> + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
> + can_stream = true;
> + }
> ..
> +#define ChangeIsInsertOrUpdate(action) \
> + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
> + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
> + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
>
> How can we clear the RBTXN_HAS_TOAST_INSERT flag on
> REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?

Partial toast insert means we have inserted in the toast but not in
the main table.  So even if it is spec insert we can form the complete
tuple, however, we can still not stream it because we haven't got
spec_confirm but for that, we are marking another flag.  So if the
insert is aspect insert the toast insert will also be spec insert and
as part of that toast, spec inserts we are marking partial tuple so
cleaning that flag should happen when the spec insert is done for the
main table right?


> IIUC, the basic idea used to handle incomplete changes (which is
> possible in case of toast tuples and speculative inserts) is to mark
> such TXNs as containing incomplete changes and then while finding the
> largest top-level TXN for streaming, we ignore such TXN's and move to
> next largest TXN.  If none of the TXNs have complete changes then we
> choose the largest (sub)transaction and spill the same to make the
> in-memory changes below logical_decoding_work_mem threshold.  This
> idea can work but the strategy to choose the transaction is suboptimal
> for cases where TXNs have some changes which are complete followed by
> an incomplete toast or speculative tuple.  I was having an offlist
> discussion with Robert on this problem and he suggested that it would
> be better if we track the complete part of changes separately and then
> we can avoid the drawback mentioned above.  I have thought about this
> and I think it can work if we track the size and LSN of completed
> changes.  I think we need to ensure that if there is concurrent abort
> then we discard all changes for current (sub)transaction not only up
> to completed changes LSN whereas if the streaming is successful then
> we can truncate the changes only up to completed changes LSN. What do
> you think?
>
> I wonder why you have done this as 0010 in the patch series, it should
> be as 0006 after the
> 0005-Implement-streaming-mode-in-ReorderBuffer.patch.  If we can do
> that way then it would be easier for me to review.  Is there a reason
> for not doing so?

No reason, I can do that.  Actually, later we can merge the changes to
0005 only, I kept separate for review.  Anyway, in the next version, I
will make it as 0006.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, May 19, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > 3.
> > > + /*
> > > + * If streaming is enable and we have serialized this transaction because
> > > + * it had incomplete tuple.  So if now we have got the complete tuple we
> > > + * can stream it.
> > > + */
> > > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> > > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> > > + {
> > >
> > > This comment is just saying what you are doing in the if-check.  I
> > > think you need to explain the rationale behind it. I don't like the
> > > variable name 'can_stream' because it matches ReorderBufferCanStream
> > > whereas it is for a different purpose, how about naming it as
> > > 'change_complete' or something like that.  The check has many
> > > conditions, can we move it to a separate function to make the code
> > > here look clean?
> > >
> >
> > Do we really need this?  Immediately after this check, we are calling
> > ReorderBufferCheckMemoryLimit which will anyway stream the changes if
> > required.
>
> Actually, ReorderBufferCheckMemoryLimit is only meant for checking
> whether we need to stream the changes due to the memory limit.  But
> suppose when memory limit exceeds that time we could not stream the
> transaction because there was only incomplete toast insert so we
> serialized.  Now,  when we get the tuple which makes the changes
> complete but now it is not crossing the memory limit as changes were
> already serialized.  So I am not sure whether it is a good idea to
> stream the transaction as soon as we get the complete changes or we
> shall wait till next time memory limit exceed and that time we select
> the suitable candidate.
>

I think it is better to wait till next time we exceed the memory threshold.

>  Ideally, we were are in streaming more and
> the transaction is serialized means it was already a candidate for
> streaming but could not stream due to the incomplete changes so
> shouldn't we stream it immediately as soon as its changes are complete
> even though now we are in memory limit.
>

The only time we need to stream or spill is when we exceed memory
threshold.  In the above case, it is possible that next time there is
some other candidate transaction that we can stream.

> >
> > Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:
> >
> > + else if (rbtxn_has_toast_insert(txn) &&
> > + ChangeIsInsertOrUpdate(change->action))
> > + {
> > + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
> > + can_stream = true;
> > + }
> > ..
> > +#define ChangeIsInsertOrUpdate(action) \
> > + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
> > + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
> > + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
> >
> > How can we clear the RBTXN_HAS_TOAST_INSERT flag on
> > REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?
>
> Partial toast insert means we have inserted in the toast but not in
> the main table.  So even if it is spec insert we can form the complete
> tuple, however, we can still not stream it because we haven't got
> spec_confirm but for that, we are marking another flag.  So if the
> insert is aspect insert the toast insert will also be spec insert and
> as part of that toast, spec inserts we are marking partial tuple so
> cleaning that flag should happen when the spec insert is done for the
> main table right?
>

Sounds reasonable.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 4.
> > +static void
> > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > +{
> > + LogicalDecodingContext *ctx = cache->private_data;
> > + LogicalErrorCallbackState state;
> > + ErrorContextCallback errcallback;
> > +
> > + Assert(!ctx->fast_forward);
> > +
> > + /* We're only supposed to call this when streaming is supported. */
> > + Assert(ctx->streaming);
> > +
> > + /* Push callback + info on the error context stack */
> > + state.ctx = ctx;
> > + state.callback_name = "stream_start";
> > + /* state.report_location = apply_lsn; */
> >
> > Why can't we supply the report_location here?  I think here we need to
> > report txn->first_lsn if this is the very first stream and
> > txn->final_lsn if it is any consecutive one.
>
> Done
>

Now after your change in stream_start_cb_wrapper, we assign
report_location as first_lsn passed as input to function but
write_location is still txn->first_lsn.  Shouldn't we assing passed in
first_lsn to write_location?  It seems assigning txn->first_lsn won't
be correct for streams other than first-one.

> > 5.
> > +static void
> > +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > +{
> > + LogicalDecodingContext *ctx = cache->private_data;
> > + LogicalErrorCallbackState state;
> > + ErrorContextCallback errcallback;
> > +
> > + Assert(!ctx->fast_forward);
> > +
> > + /* We're only supposed to call this when streaming is supported. */
> > + Assert(ctx->streaming);
> > +
> > + /* Push callback + info on the error context stack */
> > + state.ctx = ctx;
> > + state.callback_name = "stream_stop";
> > + /* state.report_location = apply_lsn; */
> >
> > Can't we report txn->final_lsn here
>
> We are already setting this to the  txn->final_ls in 0006 patch, but I
> have moved it into this patch now.
>

Similar to previous point, here also, I think we need to assign report
and write location as last_lsn passed to this API.

>
>
> > v20-0005-Implement-streaming-mode-in-ReorderBuffer
> > -----------------------------------------------------------------------------
> > 10.
> > Theoretically, we could get rid of the k-way merge, and append the
> > changes to the toplevel xact directly (and remember the position
> > in the list in case the subxact gets aborted later).
> >
> > I don't think this part of the commit message is correct as we
> > sometimes need to spill even during streaming.  Please check the
> > entire commit message and update according to the latest
> > implementation.
>
> Done
>

You seem to forgot about removing the other part of message ("This
adds a second iterator for the streaming case...." which is not
relavant now.

> > 11.
> > - * HeapTupleSatisfiesHistoricMVCC.
> > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > + *
> > + * We do build the hash table even if there are no CIDs. That's
> > + * because when streaming in-progress transactions we may run into
> > + * tuples with the CID before actually decoding them. Think e.g. about
> > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > + * yet when applying the INSERT. So we build a hash table so that
> > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > + *
> > + * XXX We might limit this behavior to streaming mode, and just bail
> > + * out when decoding transaction at commit time (at which point it's
> > + * guaranteed to see all CIDs).
> >   */
> >  static void
> >  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > *rb, ReorderBufferTXN *txn)
> >   dlist_iter iter;
> >   HASHCTL hash_ctl;
> >
> > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > - return;
> > -
> >
> > I don't understand this change.  Why would "INSERT followed by
> > TRUNCATE" could lead to a tuple which can come for decode before its
> > CID?  The patch has made changes based on this assumption in
> > HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> > behavior could be dependent on whether we are streaming the changes
> > for in-progress xact or at the commit of a transaction.  We might want
> > to generate a test to once validate this behavior.
> >
> > Also, the comment refers to tqual.c which is wrong as this API is now
> > in heapam_visibility.c.
>
> Done.
>

+ * INSERT.  So in such cases we assume the CIDs is from the future command
+ * and return as unresolve.
+ */
+ if (tuplecid_data == NULL)
+ return false;
+

Here lets reword the last line of comment as ".  So in such cases we
assume the CID is from the future command."

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 3.
> > And, during catalog scan we can check the status of the xid and
> > + * if it is aborted we will report a specific error that we can ignore.  We
> > + * might have already streamed some of the changes for the aborted
> > + * (sub)transaction, but that is fine because when we decode the abort we will
> > + * stream abort message to truncate the changes in the subscriber.
> > + */
> > +static inline void
> > +SetupCheckXidLive(TransactionId xid)
> >
> > In the above comment, I don't think it is right to say that we ignore
> > the error raised due to the aborted transaction.  We need to say that
> > we discard the already streamed changes on such an error.
>
> Done.
>

In the same comment, there is typo (/messageto/message to).

> > 4.
> > +static inline void
> > +SetupCheckXidLive(TransactionId xid)
> > +{
> >   /*
> > - * If this transaction has no snapshot, it didn't make any changes to the
> > - * database, so there's nothing to decode.  Note that
> > - * ReorderBufferCommitChild will have transferred any snapshots from
> > - * subtransactions if there were any.
> > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> > + * aborted. That will happen during catalog access.  Also reset the
> > + * sysbegin_called flag.
> >   */
> > - if (txn->base_snapshot == NULL)
> > + if (!TransactionIdDidCommit(xid))
> >   {
> > - Assert(txn->ninvalidations == 0);
> > - ReorderBufferCleanupTXN(rb, txn);
> > - return;
> > + CheckXidAlive = xid;
> > + bsysscan = false;
> >   }
> >
> > I think this function is inline as it needs to be called for each
> > change. If that is the case and otherwise also, isn't it better that
> > we check if passed xid is the same as CheckXidAlive before checking
> > TransactionIdDidCommit as TransactionIdDidCommit can be costly and
> > calling it for each change might not be a good idea?
>
> Done,  Also I think it is good the check the TransactionIdIsInProgress
> instead of !TransactionIdDidCommit.  I have changed that as well.
>

What if it is aborted just before this check?  I think the decode API
won't be able to detect that and sys* API won't care to check because
CheckXidAlive won't be set for that case.

> > 5.
> > setup CheckXidAlive if it's not committed yet. We don't check if the xid
> > + * aborted. That will happen during catalog access.  Also reset the
> > + * sysbegin_called flag.
> >
> > /if the xid aborted/if the xid is aborted.  missing comma after Also.
>
> Done
>

You forgot to change as per the second part of the comment (missing
comma after Also).


>
> > 8.
> > @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> >   * use as a normal record. It'll be cleaned up at the end
> >   * of INSERT processing.
> >   */
> > - if (specinsert == NULL)
> > - elog(ERROR, "invalid ordering of speculative insertion changes");
> >
> > You have removed this check but all other handling of specinsert is
> > same as far as this patch is concerned.  Why so?
>
> Seems like a merge issue, or the leftover from the old design of the
> toast handling where we were streaming with the partial tuple.
> fixed now.
>
> > 9.
> > @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> >   * freed/reused while restoring spooled data from
> >   * disk.
> >   */
> > - Assert(change->data.tp.newtuple != NULL);
> > -
> >   dlist_delete(&change->node);
> >
> > Why is this Assert removed?
>
> Same cause as above so fixed.
>
> > 10.
> > @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> >   relations[nrelations++] = relation;
> >   }
> >
> > - rb->apply_truncate(rb, txn, nrelations, relations, change);
> > + if (streaming)
> > + {
> > + rb->stream_truncate(rb, txn, nrelations, relations, change);
> > +
> > + /* Remember that we have sent some data. */
> > + change->txn->any_data_sent = true;
> > + }
> > + else
> > + rb->apply_truncate(rb, txn, nrelations, relations, change);
> >
> > Can we encapsulate this in a separate function like
> > ReorderBufferApplyTruncate or something like that?  Basically, rather
> > than having streaming check in this function, lets do it in some other
> > internal function.  And we can likewise do it for all the streaming
> > checks in this function or at least whereever it is feasible.  That
> > will make this function look clean.
>
> Done for truncate and change.  I think we can create a few more such
> functions for
> start/stop and cleanup handling on error.  I will work on that.
>

Yeah, I think that would be better.

One minor comment change suggestion:
/*
+ * start stream or begin the transaction.  If this is the first
+ * change in the current stream.
+ */

We can write the above comment as "Start the stream or begin the
transaction for the first change in the current stream."

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

I have further reviewed v22 and below are my comments:

v22-0005-Implement-streaming-mode-in-ReorderBuffer
--------------------------------------------------------------------------
1.
+ * Note: We never do both stream and serialize a transaction (we only spill
+ * to disk when streaming is not supported by the plugin), so only one of
+ * those two flags may be set at any given time.
+ */
+#define rbtxn_is_streamed(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
+)

The above 'Note' is not correct as per the latest implementation.

v22-0006-Add-support-for-streaming-to-built-in-replicatio
----------------------------------------------------------------------------
2.
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -14,7 +14,6 @@
  *
  *-------------------------------------------------------------------------
  */
-
 #include "postgres.h"

Spurious line removal.

3.
+void
+logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
+    XLogRecPtr commit_lsn)
+{
+ uint8 flags = 0;
+
+ pq_sendbyte(out, 'c'); /* action STREAM COMMIT */
+
+ Assert(TransactionIdIsValid(txn->xid));
+
+ /* transaction ID (we're starting to stream, so must be valid) */
+ pq_sendint32(out, txn->xid);

The part of the comment "we're starting to stream, so must be valid"
is not correct as we are not at the start of the stream here.  The
patch has used the same incorrect sentence at few places, kindly fix
those as well.

4.
+ * XXX Do we need to allocate it in TopMemoryContext?
+ */
+static void
+subxact_info_add(TransactionId xid)
{
..

For this and other places in a patch like in function
stream_open_file(), instead of using TopMemoryContext, can we consider
using a new memory context LogicalStreamingContext or something like
that. We can create LogicalStreamingContext under TopMemoryContext.  I
don't see any need of using TopMemoryContext here.

5.
+static void
+subxact_info_add(TransactionId xid)

This function has assumed a valid value for global variables like
stream_fd and stream_xid.  I think it is better to have Assert for
those in this function before using them.  The Assert for those are
present in handle_streamed_transaction but I feel they should be in
subxact_info_add.

6.
+subxact_info_add(TransactionId xid)
/*
+ * In most cases we're checking the same subxact as we've already seen in
+ * the last call, so make ure just ignore it (this change comes later).
+ */
+ if (subxact_last == xid)
+ return;

Typo and minor correction, /ure just/sure to

7.
+subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * But we free the memory allocated for subxact info. There might be one
+ * exceptional transaction with many subxacts, and we don't want to keep
+ * the memory allocated forewer.
+ *
+ */

a. Typo, /forewer/forever
b. The extra line at the end of the comment is not required.

8.
+ * XXX Maybe we should only include the checksum when the cluster is
+ * initialized with checksums?
+ */
+static void
+subxact_info_write(Oid subid, TransactionId xid)

Do we really need to have the checksum for temporary files? I have
checked a few other similar cases like SharedFileSet stuff for
parallel hash join but didn't find them using checksums.  Can you also
once see other usages of temporary files and then let us decide if we
see any reason to have checksums for this?

Another point is we don't seem to be doing this for 'changes' file,
see stream_write_change.  So, not sure, there is any sense to write
checksum for subxact file.

Tomas, do you see any reason for the same?

9.
+subxact_filename(char *path, Oid subid, TransactionId xid)
+{
+ char tempdirpath[MAXPGPATH];
+
+ TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
+
+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ */
+ if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m",
+ tempdirpath)));
+
+ snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
+ tempdirpath, subid, xid);
+}

Temporary files created in PGDATA/base/pgsql_tmp follow a certain
naming convention (see docs[1]) which is not followed here.  You can
also refer SharedFileSetPath and OpenTemporaryFile.  I think we can
just try to follow that convention and then additionally append subid,
xid and .subxacts.  Also, a similar change is required for
changes_filename.  I would like to know if there is a reason why we
want to use different naming convention here?

10.
+ * This can only be called at the beginning of a "streaming" block, i.e.
+ * between stream_start/stream_stop messages from the upstream.
+ */
+static void
+stream_close_file(void)

The comment seems to be wrong.  I think this can be only called at
stream end, so it should be "This can only be called at the end of a
"streaming" block, i.e. at stream_stop message from the upstream."

11.
+ * the order the transactions are sent in. So streamed trasactions are
+ * handled separately by using schema_sent flag in ReorderBufferTXN.
+ *
  * For partitions, 'pubactions' considers not only the table's own
  * publications, but also those of all of its ancestors.
  */
 typedef struct RelationSyncEntry
 {
  Oid relid; /* relation oid */
-
+ TransactionId xid; /* transaction that created the record */
  /*
  * Did we send the schema?  If ancestor relid is set, its schema must also
  * have been sent for this to be true.
  */
  bool schema_sent;
+ List    *streamed_txns; /* streamed toplevel transactions with this
+ * schema */

The part of comment "So streamed trasactions are handled separately by
using schema_sent flag in ReorderBufferTXN." doesn't seem to match
with what we are doing in the latest version of the patch.

12.
maybe_send_schema()
{
..
+ if (in_streaming)
+ {
+ /*
+ * TOCHECK: We have to send schema after each catalog change and it may
+ * occur when streaming already started, so we have to track new catalog
+ * changes somehow.
+ */
+ schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
..
..
}

I think it is good to once verify/test what this comment says but as
per code we should be sending the schema after each catalog change as
we invalidate the streamed_txns list in rel_sync_cache_relation_cb
which must be called during relcache invalidation.  Do we see any
problem with that mechanism?

13.
+/*
+ * Notify downstream to discard the streamed transaction (along with all
+ * it's subtransactions, if it's a toplevel transaction).
+ */
+static void
+pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
+    ReorderBufferTXN *txn,
+    XLogRecPtr commit_lsn)

This comment is copied from pgoutput_stream_abort, so doesn't match
what this function is doing.


[1] - https://www.postgresql.org/docs/devel/storage-file-layout.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> v22-0006-Add-support-for-streaming-to-built-in-replicatio
> ----------------------------------------------------------------------------
>
Few more comments on v22-0006 patch:

1.
+stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
+{
+ int i;
+ char path[MAXPGPATH];
+ bool found = false;
+
+ subxact_filename(path, subid, xid);
+
+ if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));

Here, we have unlinked the files containing information of subxacts
but don't we need to free the corresponding memory (memory for
subxacts) as well?

2.
apply_handle_stream_abort()
{
..
+ subxact_filename(path, MyLogicalRepWorker->subid, xid);
+
+ if (unlink(path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+
+ return;
..
}

Like the previous comment, it seems here also we need to free subxacts
memory and additionally we forgot to adjust the xids array as well.

3.
apply_handle_stream_abort()
{
..
+ /* XXX optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return;
..
}

Is it possible that we didn't find the xid in subxacts array?  If so,
I think we should mention the same in comments, otherwise, we should
have an assert for found.

4.
apply_handle_stream_abort()
{
..
+ changes_filename(path, MyLogicalRepWorker->subid, xid);
+
+ if (truncate(path, subxacts[subidx].offset))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not truncate file \"%s\": %m", path)));
..
}

Will truncate works on Windows?  I see in the code we ftruncate which
is defined as chsize in win32.h and win32_port.h.  I have not tested
this so I am not very sure about this.  I got a below warning when I
tried to compile this code on Windows.  I think it is better to
ftruncate as it is used at other places in the code as well.

worker.c(798): warning C4013: 'truncate' undefined; assuming extern
returning int

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
>
> Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
> 1.
> + /*
> + * If this is a toast insert then set the corresponding bit.  Otherwise, if
> + * we have toast insert bit set and this is insert/update then clear the
> + * bit.
> + */
> + if (toast_insert)
> + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
> + else if (rbtxn_has_toast_insert(txn) &&
> + ChangeIsInsertOrUpdate(change->action))
> + {
>
> Here, it might better to add a comment on why we expect only
> Insert/Update?  Also, it might be better that we add an assert for
> other operations.

I have added comments that why on Insert/Update we clean the flag.
But I don't think we only expect insert/update,  we might get the
toast delete right? because in toast update we will do toast delete +
toast insert.  So when we get toast delete we just don't want to do
anything.

>
> 2.
> @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
>   * disk.
>   */
>   dlist_delete(&change->node);
> - ReorderBufferToastAppendChunk(rb, txn, relation,
> -   change);
> + ReorderBufferToastAppendChunk(rb, txn, relation,
> +   change);
>   }
>
> This seems to be a spurious change.

Done

> 3.
> + /*
> + * If streaming is enable and we have serialized this transaction because
> + * it had incomplete tuple.  So if now we have got the complete tuple we
> + * can stream it.
> + */
> + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> + {
>
> This comment is just saying what you are doing in the if-check.  I
> think you need to explain the rationale behind it. I don't like the
> variable name 'can_stream' because it matches ReorderBufferCanStream
> whereas it is for a different purpose, how about naming it as
> 'change_complete' or something like that.  The check has many
> conditions, can we move it to a separate function to make the code
> here look clean?

As per the other comments we have removed this part in the latest patch set.

Apart from these comments fixes, there are 2 more changes
1.  Handling of the toast tuple is changed as per the offlist
discussion with you
Basically, now, instead of not streaming the txn with the incomplete
tuple, we are streaming it up to the last complete lsn.  So of the txn
has incomplete changes but its complete size is largest then we will
stream this.  And, after streaming we will truncate the transaction up
to the last complete lsn.

2. There is a bug fix in handling the stream abort in 0008 (earlier it
was 0006).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Review comments:
> > > ------------------------------
> > > 1.
> > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > > TransactionId xid,
> > >   }
> > >
> > >   case REORDER_BUFFER_CHANGE_MESSAGE:
> > > - rb->message(rb, txn, change->lsn, true,
> > > - change->data.msg.prefix,
> > > - change->data.msg.message_size,
> > > - change->data.msg.message);
> > > + if (streaming)
> > > + rb->stream_message(rb, txn, change->lsn, true,
> > > +    change->data.msg.prefix,
> > > +    change->data.msg.message_size,
> > > +    change->data.msg.message);
> > > + else
> > > + rb->message(rb, txn, change->lsn, true,
> > > +    change->data.msg.prefix,
> > > +    change->data.msg.message_size,
> > > +    change->data.msg.message);
> > >
> > > Don't we need to set any_data_sent flag while streaming messages as we
> > > do for other types of changes?
> >
> > I think any_data_sent, was added to avoid sending abort to the
> > subscriber if we haven't sent any data,  but this is not complete as
> > the output plugin can also take the decision not to send.  So I think
> > this should not be done as part of this patch and can be done
> > separately.  I think there is already a thread for handling the
> > same[1]
> >
>
> Hmm, but prior to this patch, we never use to send (empty) aborts but
> now that will be possible. It is probably okay to deal that with
> another patch mentioned by you but I felt at least any_data_sent will
> work for some cases.  OTOH, it appears to be half-baked solution, so
> we should probably refrain from adding it.  BTW, how do the pgoutput
> plugin deal with it? I see that apply_handle_stream_abort will
> unconditionally try to unlink the file and it will probably fail.
> Have you tested this scenario after your latest changes?

Yeah, I see, I think this is a problem,  but this exists without my
latest change as well, if pgoutput ignore some changes because it is
not published then we will see a similar error.  Shall we handle the
ENOENT error case from unlink?  I think the best idea is that we shall
track the empty transaction.

> > > 4.
> > > In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
> > > the try and catch block.  If there is an error after calling it in a
> > > try block, we might call it again via catch.  I think that will lead
> > > to sending a stop message twice.  Won't that be a problem?  See the
> > > usage of iterstate in the catch block, we have made it safe from a
> > > similar problem.
> >
> > IMHO, we don't need that, because we only call stream_stop in the
> > catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK.  So if
> > in TRY block we have already stopped the stream then we should not get
> > that error.  I have added the comments for the same.
> >
>
> I am still slightly nervous about it as I don't see any solid
> guarantee for the same.  You are right as the code stands today but
> due to any code that gets added in the future, it might not remain
> true. I feel it is better to have an Assert here to ensure that
> stream_stop won't be called the second time.  I don't see any good way
> of doing it other than by maintaining flag or some state but I think
> it will be good to ensure this.

Done

> > > 6.
> > > PG_CATCH();
> > >   {
> > > + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> > > + ErrorData  *errdata = CopyErrorData();
> > >
> > > I don't understand the usage of memory context in this part of the
> > > code.  Basically, you are switching to CurrentMemoryContext here, do
> > > some error handling and then again reset back to some random context
> > > before rethrowing the error.  If there is some purpose for it, then it
> > > might be better if you can write a few comments to explain the same.
> >
> > Basically, the ccxt is the CurrentMemoryContext when we started the
> > streaming and ecxt it the context when we catch the error.  So
> > ideally, before this change, it will rethrow in the context when we
> > catch the error i.e. ecxt.  So what we are trying to do is put it back
> > to normal context (ccxt) and copy the error data in the normal
> > context.  And, if we are not handling it gracefully then put it back
> > to the context it was in, and rethrow.
> >
>
> Okay, but when errorcode is *not* ERRCODE_TRANSACTION_ROLLBACK, don't
> we need to clean up the reorderbuffer by calling
> ReorderBufferCleanupTXN?  If so, then you can try to combine it with
> the not-streaming else loop.

Done


> > > 8.
> > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > *rb, TransactionId xid,
> > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > >
> > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > +
> > > + /*
> > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > + * if one of its children has.
> > > + */
> > > + if (txn->toptxn != NULL)
> > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > >  }
> > >
> > > Why are we marking top transaction here?
> >
> > We need to mark top transaction to decide whether to build tuplecid
> > hash or not.  In non-streaming mode, we are only sending during the
> > commit time, and during commit time we know whether the top
> > transaction has any catalog changes or not based on the invalidation
> > message so we are marking the top transaction there in DecodeCommit.
> > Since here we are not waiting till commit so we need to mark the top
> > transaction as soon as we mark any of its child transactions.
> >
>
> But how does it help?  We use this flag (via
> ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> anyway done in DecodeCommit and that too after setting this flag for
> the top transaction if required.  So, how will it help in setting it
> while processing for subxid.  Also, even if we have to do it won't it
> add the xid needlessly in builder->committed.xip array?

In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
to build the tuplecid hash or not based on whether it has catalog
changes or not.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 19, 2020 at 4:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 19, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 19, 2020 at 2:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > 3.
> > > > + /*
> > > > + * If streaming is enable and we have serialized this transaction because
> > > > + * it had incomplete tuple.  So if now we have got the complete tuple we
> > > > + * can stream it.
> > > > + */
> > > > + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> > > > + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> > > > + {
> > > >
> > > > This comment is just saying what you are doing in the if-check.  I
> > > > think you need to explain the rationale behind it. I don't like the
> > > > variable name 'can_stream' because it matches ReorderBufferCanStream
> > > > whereas it is for a different purpose, how about naming it as
> > > > 'change_complete' or something like that.  The check has many
> > > > conditions, can we move it to a separate function to make the code
> > > > here look clean?
> > > >
> > >
> > > Do we really need this?  Immediately after this check, we are calling
> > > ReorderBufferCheckMemoryLimit which will anyway stream the changes if
> > > required.
> >
> > Actually, ReorderBufferCheckMemoryLimit is only meant for checking
> > whether we need to stream the changes due to the memory limit.  But
> > suppose when memory limit exceeds that time we could not stream the
> > transaction because there was only incomplete toast insert so we
> > serialized.  Now,  when we get the tuple which makes the changes
> > complete but now it is not crossing the memory limit as changes were
> > already serialized.  So I am not sure whether it is a good idea to
> > stream the transaction as soon as we get the complete changes or we
> > shall wait till next time memory limit exceed and that time we select
> > the suitable candidate.
> >
>
> I think it is better to wait till next time we exceed the memory threshold.

Okay, done this way.


> >  Ideally, we were are in streaming more and
> > the transaction is serialized means it was already a candidate for
> > streaming but could not stream due to the incomplete changes so
> > shouldn't we stream it immediately as soon as its changes are complete
> > even though now we are in memory limit.
> >
>
> The only time we need to stream or spill is when we exceed memory
> threshold.  In the above case, it is possible that next time there is
> some other candidate transaction that we can stream.
>
> > >
> > > Another comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple:
> > >
> > > + else if (rbtxn_has_toast_insert(txn) &&
> > > + ChangeIsInsertOrUpdate(change->action))
> > > + {
> > > + toptxn->txn_flags &= ~RBTXN_HAS_TOAST_INSERT;
> > > + can_stream = true;
> > > + }
> > > ..
> > > +#define ChangeIsInsertOrUpdate(action) \
> > > + (((action) == REORDER_BUFFER_CHANGE_INSERT) || \
> > > + ((action) == REORDER_BUFFER_CHANGE_UPDATE) || \
> > > + ((action) == REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT))
> > >
> > > How can we clear the RBTXN_HAS_TOAST_INSERT flag on
> > > REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT action?
> >
> > Partial toast insert means we have inserted in the toast but not in
> > the main table.  So even if it is spec insert we can form the complete
> > tuple, however, we can still not stream it because we haven't got
> > spec_confirm but for that, we are marking another flag.  So if the
> > insert is aspect insert the toast insert will also be spec insert and
> > as part of that toast, spec inserts we are marking partial tuple so
> > cleaning that flag should happen when the spec insert is done for the
> > main table right?
> >
>
> Sounds reasonable.

ok




--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 19, 2020 at 5:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > 4.
> > > +static void
> > > +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > +{
> > > + LogicalDecodingContext *ctx = cache->private_data;
> > > + LogicalErrorCallbackState state;
> > > + ErrorContextCallback errcallback;
> > > +
> > > + Assert(!ctx->fast_forward);
> > > +
> > > + /* We're only supposed to call this when streaming is supported. */
> > > + Assert(ctx->streaming);
> > > +
> > > + /* Push callback + info on the error context stack */
> > > + state.ctx = ctx;
> > > + state.callback_name = "stream_start";
> > > + /* state.report_location = apply_lsn; */
> > >
> > > Why can't we supply the report_location here?  I think here we need to
> > > report txn->first_lsn if this is the very first stream and
> > > txn->final_lsn if it is any consecutive one.
> >
> > Done
> >
>
> Now after your change in stream_start_cb_wrapper, we assign
> report_location as first_lsn passed as input to function but
> write_location is still txn->first_lsn.  Shouldn't we assing passed in
> first_lsn to write_location?  It seems assigning txn->first_lsn won't
> be correct for streams other than first-one.

Done

>
> > > 5.
> > > +static void
> > > +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> > > +{
> > > + LogicalDecodingContext *ctx = cache->private_data;
> > > + LogicalErrorCallbackState state;
> > > + ErrorContextCallback errcallback;
> > > +
> > > + Assert(!ctx->fast_forward);
> > > +
> > > + /* We're only supposed to call this when streaming is supported. */
> > > + Assert(ctx->streaming);
> > > +
> > > + /* Push callback + info on the error context stack */
> > > + state.ctx = ctx;
> > > + state.callback_name = "stream_stop";
> > > + /* state.report_location = apply_lsn; */
> > >
> > > Can't we report txn->final_lsn here
> >
> > We are already setting this to the  txn->final_ls in 0006 patch, but I
> > have moved it into this patch now.
> >
>
> Similar to previous point, here also, I think we need to assign report
> and write location as last_lsn passed to this API.

Done

> >
> > > v20-0005-Implement-streaming-mode-in-ReorderBuffer
> > > -----------------------------------------------------------------------------
> > > 10.
> > > Theoretically, we could get rid of the k-way merge, and append the
> > > changes to the toplevel xact directly (and remember the position
> > > in the list in case the subxact gets aborted later).
> > >
> > > I don't think this part of the commit message is correct as we
> > > sometimes need to spill even during streaming.  Please check the
> > > entire commit message and update according to the latest
> > > implementation.
> >
> > Done
> >
>
> You seem to forgot about removing the other part of message ("This
> adds a second iterator for the streaming case...." which is not
> relavant now.

Done


> > > 11.
> > > - * HeapTupleSatisfiesHistoricMVCC.
> > > + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> > > + *
> > > + * We do build the hash table even if there are no CIDs. That's
> > > + * because when streaming in-progress transactions we may run into
> > > + * tuples with the CID before actually decoding them. Think e.g. about
> > > + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> > > + * yet when applying the INSERT. So we build a hash table so that
> > > + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> > > + *
> > > + * XXX We might limit this behavior to streaming mode, and just bail
> > > + * out when decoding transaction at commit time (at which point it's
> > > + * guaranteed to see all CIDs).
> > >   */
> > >  static void
> > >  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> > > *rb, ReorderBufferTXN *txn)
> > >   dlist_iter iter;
> > >   HASHCTL hash_ctl;
> > >
> > > - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > > - return;
> > > -
> > >
> > > I don't understand this change.  Why would "INSERT followed by
> > > TRUNCATE" could lead to a tuple which can come for decode before its
> > > CID?  The patch has made changes based on this assumption in
> > > HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> > > behavior could be dependent on whether we are streaming the changes
> > > for in-progress xact or at the commit of a transaction.  We might want
> > > to generate a test to once validate this behavior.
> > >
> > > Also, the comment refers to tqual.c which is wrong as this API is now
> > > in heapam_visibility.c.
> >
> > Done.
> >
>
> + * INSERT.  So in such cases we assume the CIDs is from the future command
> + * and return as unresolve.
> + */
> + if (tuplecid_data == NULL)
> + return false;
> +
>
> Here lets reword the last line of comment as ".  So in such cases we
> assume the CID is from the future command."

Done


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, May 13, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > 3.
> > > And, during catalog scan we can check the status of the xid and
> > > + * if it is aborted we will report a specific error that we can ignore.  We
> > > + * might have already streamed some of the changes for the aborted
> > > + * (sub)transaction, but that is fine because when we decode the abort we will
> > > + * stream abort message to truncate the changes in the subscriber.
> > > + */
> > > +static inline void
> > > +SetupCheckXidLive(TransactionId xid)
> > >
> > > In the above comment, I don't think it is right to say that we ignore
> > > the error raised due to the aborted transaction.  We need to say that
> > > we discard the already streamed changes on such an error.
> >
> > Done.
> >
>
> In the same comment, there is typo (/messageto/message to).

Done

> > > 4.
> > > +static inline void
> > > +SetupCheckXidLive(TransactionId xid)
> > > +{
> > >   /*
> > > - * If this transaction has no snapshot, it didn't make any changes to the
> > > - * database, so there's nothing to decode.  Note that
> > > - * ReorderBufferCommitChild will have transferred any snapshots from
> > > - * subtransactions if there were any.
> > > + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> > > + * aborted. That will happen during catalog access.  Also reset the
> > > + * sysbegin_called flag.
> > >   */
> > > - if (txn->base_snapshot == NULL)
> > > + if (!TransactionIdDidCommit(xid))
> > >   {
> > > - Assert(txn->ninvalidations == 0);
> > > - ReorderBufferCleanupTXN(rb, txn);
> > > - return;
> > > + CheckXidAlive = xid;
> > > + bsysscan = false;
> > >   }
> > >
> > > I think this function is inline as it needs to be called for each
> > > change. If that is the case and otherwise also, isn't it better that
> > > we check if passed xid is the same as CheckXidAlive before checking
> > > TransactionIdDidCommit as TransactionIdDidCommit can be costly and
> > > calling it for each change might not be a good idea?
> >
> > Done,  Also I think it is good the check the TransactionIdIsInProgress
> > instead of !TransactionIdDidCommit.  I have changed that as well.
> >
>
> What if it is aborted just before this check?  I think the decode API
> won't be able to detect that and sys* API won't care to check because
> CheckXidAlive won't be set for that case.

Yeah, that's the problem,  I think it should be TransactionIdDidCommit only.

> > > 5.
> > > setup CheckXidAlive if it's not committed yet. We don't check if the xid
> > > + * aborted. That will happen during catalog access.  Also reset the
> > > + * sysbegin_called flag.
> > >
> > > /if the xid aborted/if the xid is aborted.  missing comma after Also.
> >
> > Done
> >
>
> You forgot to change as per the second part of the comment (missing
> comma after Also).

Done


> > > 8.
> > > @@ -1588,8 +1766,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > >   * use as a normal record. It'll be cleaned up at the end
> > >   * of INSERT processing.
> > >   */
> > > - if (specinsert == NULL)
> > > - elog(ERROR, "invalid ordering of speculative insertion changes");
> > >
> > > You have removed this check but all other handling of specinsert is
> > > same as far as this patch is concerned.  Why so?
> >
> > Seems like a merge issue, or the leftover from the old design of the
> > toast handling where we were streaming with the partial tuple.
> > fixed now.
> >
> > > 9.
> > > @@ -1676,8 +1860,6 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > >   * freed/reused while restoring spooled data from
> > >   * disk.
> > >   */
> > > - Assert(change->data.tp.newtuple != NULL);
> > > -
> > >   dlist_delete(&change->node);
> > >
> > > Why is this Assert removed?
> >
> > Same cause as above so fixed.
> >
> > > 10.
> > > @@ -1753,7 +1935,15 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> > >   relations[nrelations++] = relation;
> > >   }
> > >
> > > - rb->apply_truncate(rb, txn, nrelations, relations, change);
> > > + if (streaming)
> > > + {
> > > + rb->stream_truncate(rb, txn, nrelations, relations, change);
> > > +
> > > + /* Remember that we have sent some data. */
> > > + change->txn->any_data_sent = true;
> > > + }
> > > + else
> > > + rb->apply_truncate(rb, txn, nrelations, relations, change);
> > >
> > > Can we encapsulate this in a separate function like
> > > ReorderBufferApplyTruncate or something like that?  Basically, rather
> > > than having streaming check in this function, lets do it in some other
> > > internal function.  And we can likewise do it for all the streaming
> > > checks in this function or at least whereever it is feasible.  That
> > > will make this function look clean.
> >
> > Done for truncate and change.  I think we can create a few more such
> > functions for
> > start/stop and cleanup handling on error.  I will work on that.
> >
>
> Yeah, I think that would be better.

I have done some refactoring, please look into the latest version.

> One minor comment change suggestion:
> /*
> + * start stream or begin the transaction.  If this is the first
> + * change in the current stream.
> + */
>
> We can write the above comment as "Start the stream or begin the
> transaction for the first change in the current stream."

Done


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
>
> I have further reviewed v22 and below are my comments:
>
> v22-0005-Implement-streaming-mode-in-ReorderBuffer
> --------------------------------------------------------------------------
> 1.
> + * Note: We never do both stream and serialize a transaction (we only spill
> + * to disk when streaming is not supported by the plugin), so only one of
> + * those two flags may be set at any given time.
> + */
> +#define rbtxn_is_streamed(txn) \
> +( \
> + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
> +)
>
> The above 'Note' is not correct as per the latest implementation.

That is removed in 0010 in the latest version you can see in 0006.

> v22-0006-Add-support-for-streaming-to-built-in-replicatio
> ----------------------------------------------------------------------------
> 2.
> --- a/src/backend/replication/logical/launcher.c
> +++ b/src/backend/replication/logical/launcher.c
> @@ -14,7 +14,6 @@
>   *
>   *-------------------------------------------------------------------------
>   */
> -
>  #include "postgres.h"
>
> Spurious line removal.

Fixed

> 3.
> +void
> +logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
> +    XLogRecPtr commit_lsn)
> +{
> + uint8 flags = 0;
> +
> + pq_sendbyte(out, 'c'); /* action STREAM COMMIT */
> +
> + Assert(TransactionIdIsValid(txn->xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, txn->xid);
>
> The part of the comment "we're starting to stream, so must be valid"
> is not correct as we are not at the start of the stream here.  The
> patch has used the same incorrect sentence at few places, kindly fix
> those as well.

I have removed that part of the comment.

> 4.
> + * XXX Do we need to allocate it in TopMemoryContext?
> + */
> +static void
> +subxact_info_add(TransactionId xid)
> {
> ..
>
> For this and other places in a patch like in function
> stream_open_file(), instead of using TopMemoryContext, can we consider
> using a new memory context LogicalStreamingContext or something like
> that. We can create LogicalStreamingContext under TopMemoryContext.  I
> don't see any need of using TopMemoryContext here.

But, when we will delete/reset the LogicalStreamingContext?  because
we are planning to keep this memory until the worker is alive so that
supposed to be the top memory context.  If we create any other context
with the same life span as TopMemoryContext then what is the point?
Am I missing something?

> 5.
> +static void
> +subxact_info_add(TransactionId xid)
>
> This function has assumed a valid value for global variables like
> stream_fd and stream_xid.  I think it is better to have Assert for
> those in this function before using them.  The Assert for those are
> present in handle_streamed_transaction but I feel they should be in
> subxact_info_add.

Done

> 6.
> +subxact_info_add(TransactionId xid)
> /*
> + * In most cases we're checking the same subxact as we've already seen in
> + * the last call, so make ure just ignore it (this change comes later).
> + */
> + if (subxact_last == xid)
> + return;
>
> Typo and minor correction, /ure just/sure to

Done

> 7.
> +subxact_info_write(Oid subid, TransactionId xid)
> {
> ..
> + /*
> + * But we free the memory allocated for subxact info. There might be one
> + * exceptional transaction with many subxacts, and we don't want to keep
> + * the memory allocated forewer.
> + *
> + */
>
> a. Typo, /forewer/forever
> b. The extra line at the end of the comment is not required.

Done


> 8.
> + * XXX Maybe we should only include the checksum when the cluster is
> + * initialized with checksums?
> + */
> +static void
> +subxact_info_write(Oid subid, TransactionId xid)
>
> Do we really need to have the checksum for temporary files? I have
> checked a few other similar cases like SharedFileSet stuff for
> parallel hash join but didn't find them using checksums.  Can you also
> once see other usages of temporary files and then let us decide if we
> see any reason to have checksums for this?

Yeah, even I can see other places checksum is not used.

>
> Another point is we don't seem to be doing this for 'changes' file,
> see stream_write_change.  So, not sure, there is any sense to write
> checksum for subxact file.

I can see there are comment atop this function

* XXX The subxact file includes CRC32C of the contents. Maybe we should
* include something like that here too, but doing so will not be as
* straighforward, because we write the file in chunks.

>
> Tomas, do you see any reason for the same?


> 9.
> +subxact_filename(char *path, Oid subid, TransactionId xid)
> +{
> + char tempdirpath[MAXPGPATH];
> +
> + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
> +
> + /*
> + * We might need to create the tablespace's tempfile directory, if no
> + * one has yet done so.
> + */
> + if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not create directory \"%s\": %m",
> + tempdirpath)));
> +
> + snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
> + tempdirpath, subid, xid);
> +}
>
> Temporary files created in PGDATA/base/pgsql_tmp follow a certain
> naming convention (see docs[1]) which is not followed here.  You can
> also refer SharedFileSetPath and OpenTemporaryFile.  I think we can
> just try to follow that convention and then additionally append subid,
> xid and .subxacts.  Also, a similar change is required for
> changes_filename.  I would like to know if there is a reason why we
> want to use different naming convention here?

I have changed it to this: pgsql_tmpPID-subid-xid.subxacts.

> 10.
> + * This can only be called at the beginning of a "streaming" block, i.e.
> + * between stream_start/stream_stop messages from the upstream.
> + */
> +static void
> +stream_close_file(void)
>
> The comment seems to be wrong.  I think this can be only called at
> stream end, so it should be "This can only be called at the end of a
> "streaming" block, i.e. at stream_stop message from the upstream."

Right, I have fixed it.

> 11.
> + * the order the transactions are sent in. So streamed trasactions are
> + * handled separately by using schema_sent flag in ReorderBufferTXN.
> + *
>   * For partitions, 'pubactions' considers not only the table's own
>   * publications, but also those of all of its ancestors.
>   */
>  typedef struct RelationSyncEntry
>  {
>   Oid relid; /* relation oid */
> -
> + TransactionId xid; /* transaction that created the record */
>   /*
>   * Did we send the schema?  If ancestor relid is set, its schema must also
>   * have been sent for this to be true.
>   */
>   bool schema_sent;
> + List    *streamed_txns; /* streamed toplevel transactions with this
> + * schema */
>
> The part of comment "So streamed trasactions are handled separately by
> using schema_sent flag in ReorderBufferTXN." doesn't seem to match
> with what we are doing in the latest version of the patch.

Yeah, it's wrong,  I have fixed it.


> 12.
> maybe_send_schema()
> {
> ..
> + if (in_streaming)
> + {
> + /*
> + * TOCHECK: We have to send schema after each catalog change and it may
> + * occur when streaming already started, so we have to track new catalog
> + * changes somehow.
> + */
> + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
> ..
> ..
> }
>
> I think it is good to once verify/test what this comment says but as
> per code we should be sending the schema after each catalog change as
> we invalidate the streamed_txns list in rel_sync_cache_relation_cb
> which must be called during relcache invalidation.  Do we see any
> problem with that mechanism?

I have tested this, I think we are already sending the schema after
each catalog change.

> 13.
> +/*
> + * Notify downstream to discard the streamed transaction (along with all
> + * it's subtransactions, if it's a toplevel transaction).
> + */
> +static void
> +pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
> +    ReorderBufferTXN *txn,
> +    XLogRecPtr commit_lsn)
>
> This comment is copied from pgoutput_stream_abort, so doesn't match
> what this function is doing.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, May 22, 2020 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > v22-0006-Add-support-for-streaming-to-built-in-replicatio
> > ----------------------------------------------------------------------------
> >
> Few more comments on v22-0006 patch:
>
> 1.
> +stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
> +{
> + int i;
> + char path[MAXPGPATH];
> + bool found = false;
> +
> + subxact_filename(path, subid, xid);
> +
> + if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not remove file \"%s\": %m", path)));
>
> Here, we have unlinked the files containing information of subxacts
> but don't we need to free the corresponding memory (memory for
> subxacts) as well?

Basically, stream_cleanup_files, is used for
1) cleanup file on worker exit
2) while writing the first segment of the xid we clean up to ensure
there are no orphaned file with same xid.
3) After apply commit we clean up the file.

Whereas subxacts memory is used between the stream start and stream
stop as soon stream stop we write the subxacts changes to file and
free the memory.  So there is no case that we can have subxact memory
at stream_cleanup_files, except on worker exit but there we are
already exiting the worker. IMHO we don't need to free memory there.

> 2.
> apply_handle_stream_abort()
> {
> ..
> + subxact_filename(path, MyLogicalRepWorker->subid, xid);
> +
> + if (unlink(path) < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not remove file \"%s\": %m", path)));
> +
> + return;
> ..
> }
>
> Like the previous comment, it seems here also we need to free subxacts
> memory and additionally we forgot to adjust the xids array as well.

In this, we are allocating memory in subxact_info_read, but we are
again calling subxact_info_write which will free the memory.

> 3.
> apply_handle_stream_abort()
> {
> ..
> + /* XXX optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> + {
> + if (subxacts[i - 1].xid == subxid)
> + {
> + subidx = (i - 1);
> + found = true;
> + break;
> + }
> + }
> +
> + if (!found)
> + return;
> ..
> }
>
> Is it possible that we didn't find the xid in subxacts array?  If so,
> I think we should mention the same in comments, otherwise, we should
> have an assert for found.

We may not find due to the empty transaction, I have changed the comments.

> 4.
> apply_handle_stream_abort()
> {
> ..
> + changes_filename(path, MyLogicalRepWorker->subid, xid);
> +
> + if (truncate(path, subxacts[subidx].offset))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not truncate file \"%s\": %m", path)));
> ..
> }
>
> Will truncate works on Windows?  I see in the code we ftruncate which
> is defined as chsize in win32.h and win32_port.h.  I have not tested
> this so I am not very sure about this.  I got a below warning when I
> tried to compile this code on Windows.  I think it is better to
> ftruncate as it is used at other places in the code as well.
>
> worker.c(798): warning C4013: 'truncate' undefined; assuming extern
> returning int

I have changed to the ftruncate.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Erik Rijkers
Дата:
On 2020-05-25 16:37, Dilip Kumar wrote:
> On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> 
> wrote:
>> 
>> On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> 
>> wrote:
>> >
>> > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> > >
>> 
>> I have further reviewed v22 and below are my comments:
>> 

>>    [v24.tar]

Hi,

I am not able to extract all files correctly from this tar.

The first file v24-0001-* seems to have some 'binary' junk at the top.

(The other 11 files seem normally readably)


Erik Rijkers





Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, May 25, 2020 at 8:48 PM Erik Rijkers <er@xs4all.nl> wrote:
>

> Hi,
>
> I am not able to extract all files correctly from this tar.
>
> The first file v24-0001-* seems to have some 'binary' junk at the top.
>
> (The other 11 files seem normally readably)

Okay, sending again.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
> > 1.
> > + /*
> > + * If this is a toast insert then set the corresponding bit.  Otherwise, if
> > + * we have toast insert bit set and this is insert/update then clear the
> > + * bit.
> > + */
> > + if (toast_insert)
> > + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
> > + else if (rbtxn_has_toast_insert(txn) &&
> > + ChangeIsInsertOrUpdate(change->action))
> > + {
> >
> > Here, it might better to add a comment on why we expect only
> > Insert/Update?  Also, it might be better that we add an assert for
> > other operations.
>
> I have added comments that why on Insert/Update we clean the flag.
> But I don't think we only expect insert/update,  we might get the
> toast delete right? because in toast update we will do toast delete +
> toast insert.  So when we get toast delete we just don't want to do
> anything.
>

Okay, that makes sense.

> >
> > 2.
> > @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> > ReorderBufferTXN *txn,
> >   * disk.
> >   */
> >   dlist_delete(&change->node);
> > - ReorderBufferToastAppendChunk(rb, txn, relation,
> > -   change);
> > + ReorderBufferToastAppendChunk(rb, txn, relation,
> > +   change);
> >   }
> >
> > This seems to be a spurious change.
>
> Done
>
> 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> was 0006).
>

The code changes look fine but it is not clear what was the exact
issue.  Can you explain?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 22, 2020 at 6:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > Review comments:
> > > > ------------------------------
> > > > 1.
> > > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > > > TransactionId xid,
> > > >   }
> > > >
> > > >   case REORDER_BUFFER_CHANGE_MESSAGE:
> > > > - rb->message(rb, txn, change->lsn, true,
> > > > - change->data.msg.prefix,
> > > > - change->data.msg.message_size,
> > > > - change->data.msg.message);
> > > > + if (streaming)
> > > > + rb->stream_message(rb, txn, change->lsn, true,
> > > > +    change->data.msg.prefix,
> > > > +    change->data.msg.message_size,
> > > > +    change->data.msg.message);
> > > > + else
> > > > + rb->message(rb, txn, change->lsn, true,
> > > > +    change->data.msg.prefix,
> > > > +    change->data.msg.message_size,
> > > > +    change->data.msg.message);
> > > >
> > > > Don't we need to set any_data_sent flag while streaming messages as we
> > > > do for other types of changes?
> > >
> > > I think any_data_sent, was added to avoid sending abort to the
> > > subscriber if we haven't sent any data,  but this is not complete as
> > > the output plugin can also take the decision not to send.  So I think
> > > this should not be done as part of this patch and can be done
> > > separately.  I think there is already a thread for handling the
> > > same[1]
> > >
> >
> > Hmm, but prior to this patch, we never use to send (empty) aborts but
> > now that will be possible. It is probably okay to deal that with
> > another patch mentioned by you but I felt at least any_data_sent will
> > work for some cases.  OTOH, it appears to be half-baked solution, so
> > we should probably refrain from adding it.  BTW, how do the pgoutput
> > plugin deal with it? I see that apply_handle_stream_abort will
> > unconditionally try to unlink the file and it will probably fail.
> > Have you tested this scenario after your latest changes?
>
> Yeah, I see, I think this is a problem,  but this exists without my
> latest change as well, if pgoutput ignore some changes because it is
> not published then we will see a similar error.  Shall we handle the
> ENOENT error case from unlink?
>

Isn't this problem only for subxact file as we anyway create changes
file as part of start stream message which should have come after
abort?  If so, can't we detect whether subxact file exists probably by
using nsubxacts or something like that?  Can you please once try to
reproduce this scenario to ensure that we are not missing anything?

>
>
> > > > 8.
> > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > *rb, TransactionId xid,
> > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > >
> > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > +
> > > > + /*
> > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > + * if one of its children has.
> > > > + */
> > > > + if (txn->toptxn != NULL)
> > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > >  }
> > > >
> > > > Why are we marking top transaction here?
> > >
> > > We need to mark top transaction to decide whether to build tuplecid
> > > hash or not.  In non-streaming mode, we are only sending during the
> > > commit time, and during commit time we know whether the top
> > > transaction has any catalog changes or not based on the invalidation
> > > message so we are marking the top transaction there in DecodeCommit.
> > > Since here we are not waiting till commit so we need to mark the top
> > > transaction as soon as we mark any of its child transactions.
> > >
> >
> > But how does it help?  We use this flag (via
> > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > anyway done in DecodeCommit and that too after setting this flag for
> > the top transaction if required.  So, how will it help in setting it
> > while processing for subxid.  Also, even if we have to do it won't it
> > add the xid needlessly in builder->committed.xip array?
>
> In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> to build the tuplecid hash or not based on whether it has catalog
> changes or not.
>

Okay, but you haven't answered the second part of the question: "won't
it add the xid of top transaction needlessly in builder->committed.xip
array, see function SnapBuildCommitTxn?"  IIUC, this can happen
without patch as well because DecodeCommit also sets the flags just
based on invalidation messages irrespective of whether the messages
are generated by top transaction or not, is that right?  If this is
correct, please explain why we are doing so in the comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
> > > 1.
> > > + /*
> > > + * If this is a toast insert then set the corresponding bit.  Otherwise, if
> > > + * we have toast insert bit set and this is insert/update then clear the
> > > + * bit.
> > > + */
> > > + if (toast_insert)
> > > + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
> > > + else if (rbtxn_has_toast_insert(txn) &&
> > > + ChangeIsInsertOrUpdate(change->action))
> > > + {
> > >
> > > Here, it might better to add a comment on why we expect only
> > > Insert/Update?  Also, it might be better that we add an assert for
> > > other operations.
> >
> > I have added comments that why on Insert/Update we clean the flag.
> > But I don't think we only expect insert/update,  we might get the
> > toast delete right? because in toast update we will do toast delete +
> > toast insert.  So when we get toast delete we just don't want to do
> > anything.
> >
>
> Okay, that makes sense.
>
> > >
> > > 2.
> > > @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> > > ReorderBufferTXN *txn,
> > >   * disk.
> > >   */
> > >   dlist_delete(&change->node);
> > > - ReorderBufferToastAppendChunk(rb, txn, relation,
> > > -   change);
> > > + ReorderBufferToastAppendChunk(rb, txn, relation,
> > > +   change);
> > >   }
> > >
> > > This seems to be a spurious change.
> >
> > Done
> >
> > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > was 0006).
> >
>
> The code changes look fine but it is not clear what was the exact
> issue.  Can you explain?

Basically, in case of an empty subtransaction, we were reading the
subxacts info but when we could not find the subxid in the subxacts
info we were not releasing the memory.  So on next subxact_info_read
it will expect that subxacts should be freed but we did not free it in
that !found case.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 4.
> > + * XXX Do we need to allocate it in TopMemoryContext?
> > + */
> > +static void
> > +subxact_info_add(TransactionId xid)
> > {
> > ..
> >
> > For this and other places in a patch like in function
> > stream_open_file(), instead of using TopMemoryContext, can we consider
> > using a new memory context LogicalStreamingContext or something like
> > that. We can create LogicalStreamingContext under TopMemoryContext.  I
> > don't see any need of using TopMemoryContext here.
>
> But, when we will delete/reset the LogicalStreamingContext?
>

Why can't we reset it at each stream stop message?

>  because
> we are planning to keep this memory until the worker is alive so that
> supposed to be the top memory context.
>

Which part of allocation do we want to keep till the worker is alive?
Why we need memory-related to subxacts till the worker is alive?  As
we have now, after reading subxact info (subxact_info_read), we need
to ensure that it is freed after its usage due to which we need to
remember and perform pfree at various places.

I think we should once see the possibility that such that we could
switch to this new context in start stream message and reset it in
stop stream message.  That might help in avoiding
MemoryContextSwitchTo TopMemoryContext at various places.

>  If we create any other context
> with the same life span as TopMemoryContext then what is the point?
>

It is helpful for debugging.  It is recommended that we don't use the
top memory context unless it is really required.  Read about it in
src/backend/utils/mmgr/README.

>
> > 8.
> > + * XXX Maybe we should only include the checksum when the cluster is
> > + * initialized with checksums?
> > + */
> > +static void
> > +subxact_info_write(Oid subid, TransactionId xid)
> >
> > Do we really need to have the checksum for temporary files? I have
> > checked a few other similar cases like SharedFileSet stuff for
> > parallel hash join but didn't find them using checksums.  Can you also
> > once see other usages of temporary files and then let us decide if we
> > see any reason to have checksums for this?
>
> Yeah, even I can see other places checksum is not used.
>

So, unless someone speaks up before you are ready for the next version
of the patch, can we remove it?

> >
> > Another point is we don't seem to be doing this for 'changes' file,
> > see stream_write_change.  So, not sure, there is any sense to write
> > checksum for subxact file.
>
> I can see there are comment atop this function
>
> * XXX The subxact file includes CRC32C of the contents. Maybe we should
> * include something like that here too, but doing so will not be as
> * straighforward, because we write the file in chunks.
>

You can remove this comment as well.  I don't know how advantageous it
is to checksum temporary files.  We can anyway add it later if there
is a reason for doing so.

>
>
> > 12.
> > maybe_send_schema()
> > {
> > ..
> > + if (in_streaming)
> > + {
> > + /*
> > + * TOCHECK: We have to send schema after each catalog change and it may
> > + * occur when streaming already started, so we have to track new catalog
> > + * changes somehow.
> > + */
> > + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
> > ..
> > ..
> > }
> >
> > I think it is good to once verify/test what this comment says but as
> > per code we should be sending the schema after each catalog change as
> > we invalidate the streamed_txns list in rel_sync_cache_relation_cb
> > which must be called during relcache invalidation.  Do we see any
> > problem with that mechanism?
>
> I have tested this, I think we are already sending the schema after
> each catalog change.
>

Then remove "TOCHECK" in the above comment.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > > was 0006).
> > >
> >
> > The code changes look fine but it is not clear what was the exact
> > issue.  Can you explain?
>
> Basically, in case of an empty subtransaction, we were reading the
> subxacts info but when we could not find the subxid in the subxacts
> info we were not releasing the memory.  So on next subxact_info_read
> it will expect that subxacts should be freed but we did not free it in
> that !found case.
>

Okay, on looking at it again, the same code exists in
subxact_info_write as well.  It is better to have a function for it.
Can we have a structure like SubXactContext for all the variables used
for subxact?  As mentioned earlier I find the allocation/deallocation
of subxacts a bit ad-hoc, so there will always be a chance that we can
forget to free it.  Having it allocated in memory context which we can
reset later might reduce that risk.  One idea could be that we have a
special memory context for start and stop messages which can be used
to allocate the subxacts there.  In case of commit/abort, we can allow
subxacts information to be allocated in ApplyMessageContext which is
reset at the end of each protocol message.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Mahendra Singh Thalor
Дата:
On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > > > was 0006).
> > > >
> > >
> > > The code changes look fine but it is not clear what was the exact
> > > issue.  Can you explain?
> >
> > Basically, in case of an empty subtransaction, we were reading the
> > subxacts info but when we could not find the subxid in the subxacts
> > info we were not releasing the memory.  So on next subxact_info_read
> > it will expect that subxacts should be freed but we did not free it in
> > that !found case.
> >
>
> Okay, on looking at it again, the same code exists in
> subxact_info_write as well.  It is better to have a function for it.
> Can we have a structure like SubXactContext for all the variables used
> for subxact?  As mentioned earlier I find the allocation/deallocation
> of subxacts a bit ad-hoc, so there will always be a chance that we can
> forget to free it.  Having it allocated in memory context which we can
> reset later might reduce that risk.  One idea could be that we have a
> special memory context for start and stop messages which can be used
> to allocate the subxacts there.  In case of commit/abort, we can allow
> subxacts information to be allocated in ApplyMessageContext which is
> reset at the end of each protocol message.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>

Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;

Test parameters:
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'

Test results:

CREATE index operationsAdd col int(date) operationsAdd col text operations
SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN change
1
1 DDL without patch177280.89116
1.624548
9760.764393
11.475409
339040.80044
2.80792
with patch180160.80486810880.763602348560.787108
2
2 DDL without patch198720.860348
2.73752
16320.763199
13.7254902
345600.806086
3.078703
with patch204160.83906518560.733147356240.829281
3
3 DDL without patch220160.894891
3.63372093
22880.776871
14.685314
352160.803493
3.339391186
with patch228160.82802826240.737177363920.800194
4
4 DDL without patch241600.901686
4.4701986
29440.768445
15.217391
358720.77489
3.590544
with patch252400.88714333920.768382371600.82777
5
5 DDL without patch263280.901686
4.9832877
36000.751879
15.555555
365280.817928
3.832676
with patch276400.91407841600.74709379280.820621
6
6 DDL without patch284720.936385
5.5071649
42560.745179
15.78947368
371840.797043
4.066265
with patch300400.95822649280.725321386960.814535
7
8 DDL without patch327601.0022203
6.422466
55680.757468
16.091954
384960.83207
4.509559
with patch348640.96677764640.769072402320.903604
8
11 DDL without patch502961.0022203
5.662478
75360.748332
16.666666
404640.822266
5.179913
with patch531440.96677787920.750553425600.797133
9
15 DDL without patch588961.267253
5.662478
101840.776875
16.496465
431120.821916
5.84524
with patch627681.27234118640.746844456320.812567
10
1 DDL & 3 DML without patch182400.812551
1.6228
11920.771993
10.067114
341200.849467
2.8113599
with patch185360.81908913120.785117350800.855456
11
3 DDL & 5 DML without patch236560.926616
3.4832606
26560.758029
13.55421687
355840.829377
3.372302
with patch244800.91551730160.797206367840.839176
12
10 DDL & 5 DML without patch527601.101005
4.958301744
72880.763065
16.02634468
402160.837843
4.993037
with patch553761.10524184560.779257422240.835206
13
10 DML without patch10080.791091
6.349206
10080.81105
6.349206
10080.78817
6.349206
with patch10720.80787510720.77111310720.759789

To see all operations, please see[2] test_results

Summary:
Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation.  I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically,  the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is  ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast.

[1]: https://www.postgresql.org/message-id/CAFiTN-vnnrk580ucZVYnub_UQ-ayROew8fQ2Yn5aFYMeF0U03w%40mail.gmail.com
[2]: https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 25, 2020 at 8:48 PM Erik Rijkers <er@xs4all.nl> wrote:
> >
>
> > Hi,
> >
> > I am not able to extract all files correctly from this tar.
> >
> > The first file v24-0001-* seems to have some 'binary' junk at the top.
> >
> > (The other 11 files seem normally readably)
>
> Okay, sending again.

While reviewing/testing I have found a couple of problems in 0005 and
0006 which I have fixed in the attached version.

In 0005:  Basically, in the latest version, we are starting a stream
or begin txn only if there are any changes because we are doing in the
while loop, so we need to stream_stop/commit also if we have started
the stream.

In 0006: If we are streaming the serialized changed and there are
still few incomplete changes, then currently we are not deleting the
spilled file, but the spill file contains all the changes of the
transaction because there is no way to partially truncate it.  So in
the next stream, it will try to resend those.  I have fixed this by
sending the spilled transaction as soon as its changes are complete so
ideally, we can always delete the spilled file.  It is also a better
solution because this transaction is already spilled once and that
happened because we could not stream it,  so we better stream it on
the first opportunity that will reduce the replay lag which is our
whole purpose here.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 6:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Review comments:
> > > > > ------------------------------
> > > > > 1.
> > > > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > > > > TransactionId xid,
> > > > >   }
> > > > >
> > > > >   case REORDER_BUFFER_CHANGE_MESSAGE:
> > > > > - rb->message(rb, txn, change->lsn, true,
> > > > > - change->data.msg.prefix,
> > > > > - change->data.msg.message_size,
> > > > > - change->data.msg.message);
> > > > > + if (streaming)
> > > > > + rb->stream_message(rb, txn, change->lsn, true,
> > > > > +    change->data.msg.prefix,
> > > > > +    change->data.msg.message_size,
> > > > > +    change->data.msg.message);
> > > > > + else
> > > > > + rb->message(rb, txn, change->lsn, true,
> > > > > +    change->data.msg.prefix,
> > > > > +    change->data.msg.message_size,
> > > > > +    change->data.msg.message);
> > > > >
> > > > > Don't we need to set any_data_sent flag while streaming messages as we
> > > > > do for other types of changes?
> > > >
> > > > I think any_data_sent, was added to avoid sending abort to the
> > > > subscriber if we haven't sent any data,  but this is not complete as
> > > > the output plugin can also take the decision not to send.  So I think
> > > > this should not be done as part of this patch and can be done
> > > > separately.  I think there is already a thread for handling the
> > > > same[1]
> > > >
> > >
> > > Hmm, but prior to this patch, we never use to send (empty) aborts but
> > > now that will be possible. It is probably okay to deal that with
> > > another patch mentioned by you but I felt at least any_data_sent will
> > > work for some cases.  OTOH, it appears to be half-baked solution, so
> > > we should probably refrain from adding it.  BTW, how do the pgoutput
> > > plugin deal with it? I see that apply_handle_stream_abort will
> > > unconditionally try to unlink the file and it will probably fail.
> > > Have you tested this scenario after your latest changes?
> >
> > Yeah, I see, I think this is a problem,  but this exists without my
> > latest change as well, if pgoutput ignore some changes because it is
> > not published then we will see a similar error.  Shall we handle the
> > ENOENT error case from unlink?
> Isn't this problem only for subxact file as we anyway create changes
> file as part of start stream message which should have come after
> abort?  If so, can't we detect whether subxact file exists probably by
> using nsubxacts or something like that?  Can you please once try to
> reproduce this scenario to ensure that we are not missing anything?

I have tested this, as of now, by default we create both changes and
subxact files irrespective of whether we get any subtransactions or
not.  Maybe this could be optimized that only if we have any subxact
then only create that file otherwise not?  What's your opinion on the
same.

> > > > > 8.
> > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > > *rb, TransactionId xid,
> > > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > > >
> > > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > +
> > > > > + /*
> > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > > + * if one of its children has.
> > > > > + */
> > > > > + if (txn->toptxn != NULL)
> > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > >  }
> > > > >
> > > > > Why are we marking top transaction here?
> > > >
> > > > We need to mark top transaction to decide whether to build tuplecid
> > > > hash or not.  In non-streaming mode, we are only sending during the
> > > > commit time, and during commit time we know whether the top
> > > > transaction has any catalog changes or not based on the invalidation
> > > > message so we are marking the top transaction there in DecodeCommit.
> > > > Since here we are not waiting till commit so we need to mark the top
> > > > transaction as soon as we mark any of its child transactions.
> > > >
> > >
> > > But how does it help?  We use this flag (via
> > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > > anyway done in DecodeCommit and that too after setting this flag for
> > > the top transaction if required.  So, how will it help in setting it
> > > while processing for subxid.  Also, even if we have to do it won't it
> > > add the xid needlessly in builder->committed.xip array?
> >
> > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> > to build the tuplecid hash or not based on whether it has catalog
> > changes or not.
> >
>
> Okay, but you haven't answered the second part of the question: "won't
> it add the xid of top transaction needlessly in builder->committed.xip
> array, see function SnapBuildCommitTxn?"  IIUC, this can happen
> without patch as well because DecodeCommit also sets the flags just
> based on invalidation messages irrespective of whether the messages
> are generated by top transaction or not, is that right?

Yes, with or without the patch it always adds the topxid.  I think
purpose for doing this with/without patch is not for the snapshot
instead we are marking the top itself that some of its subtxn has the
catalog changes so that while building the tuplecid has we can know
whether to build the hash or not.  But, having said that I feel in
ReorderBufferBuildTupleCidHash why do we need these two checks
if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;

I mean it should be enough to just have the check,  because if we have
added something to the tuplecids then catalog changes must be there
because that time we are setting the catalog changes to true.

if (dlist_is_empty(&txn->tuplecids))
return;

I think in the base code there are multiple things going on
1. If we get new CID we always set the catalog change in that
transaction but add the tuplecids in the top transaction.  So
basically, top transaction is so far not marked with catalog changes
but it has tuplecids.
2. Now, in DecodeCommit the top xid will be marked that it has catalog
changes based on the invalidation messages.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 4.
> > > + * XXX Do we need to allocate it in TopMemoryContext?
> > > + */
> > > +static void
> > > +subxact_info_add(TransactionId xid)
> > > {
> > > ..
> > >
> > > For this and other places in a patch like in function
> > > stream_open_file(), instead of using TopMemoryContext, can we consider
> > > using a new memory context LogicalStreamingContext or something like
> > > that. We can create LogicalStreamingContext under TopMemoryContext.  I
> > > don't see any need of using TopMemoryContext here.
> >
> > But, when we will delete/reset the LogicalStreamingContext?
> >
>
> Why can't we reset it at each stream stop message?
> >  because
> > we are planning to keep this memory until the worker is alive so that
> > supposed to be the top memory context.
> >
>
> Which part of allocation do we want to keep till the worker is alive?

static TransactionId *xids = NULL; we need to keep till worker life space.

> Why we need memory-related to subxacts till the worker is alive?  As
> we have now, after reading subxact info (subxact_info_read), we need
> to ensure that it is freed after its usage due to which we need to
> remember and perform pfree at various places.
>
> I think we should once see the possibility that such that we could
> switch to this new context in start stream message and reset it in
> stop stream message.  That might help in avoiding
> MemoryContextSwitchTo TopMemoryContext at various places.

Ok, I understand, I think subxacts can be allocated in new
LogicalStreamingContext which we can reset at the stream stop.  How
about xids?
shall we create another context that will stay until the worker lifespan?

> >  If we create any other context
> > with the same life span as TopMemoryContext then what is the point?
>>
>
> It is helpful for debugging.  It is recommended that we don't use the
> top memory context unless it is really required.  Read about it in
> src/backend/utils/mmgr/README.

I see.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Isn't this problem only for subxact file as we anyway create changes
> > file as part of start stream message which should have come after
> > abort?  If so, can't we detect whether subxact file exists probably by
> > using nsubxacts or something like that?  Can you please once try to
> > reproduce this scenario to ensure that we are not missing anything?
>
> I have tested this, as of now, by default we create both changes and
> subxact files irrespective of whether we get any subtransactions or
> not.  Maybe this could be optimized that only if we have any subxact
> then only create that file otherwise not?  What's your opinion on the
> same.
>

Yeah, that makes sense.

> > > > > > 8.
> > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > > > *rb, TransactionId xid,
> > > > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > > > >
> > > > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > +
> > > > > > + /*
> > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > > > + * if one of its children has.
> > > > > > + */
> > > > > > + if (txn->toptxn != NULL)
> > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > >  }
> > > > > >
> > > > > > Why are we marking top transaction here?
> > > > >
> > > > > We need to mark top transaction to decide whether to build tuplecid
> > > > > hash or not.  In non-streaming mode, we are only sending during the
> > > > > commit time, and during commit time we know whether the top
> > > > > transaction has any catalog changes or not based on the invalidation
> > > > > message so we are marking the top transaction there in DecodeCommit.
> > > > > Since here we are not waiting till commit so we need to mark the top
> > > > > transaction as soon as we mark any of its child transactions.
> > > > >
> > > >
> > > > But how does it help?  We use this flag (via
> > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > > > anyway done in DecodeCommit and that too after setting this flag for
> > > > the top transaction if required.  So, how will it help in setting it
> > > > while processing for subxid.  Also, even if we have to do it won't it
> > > > add the xid needlessly in builder->committed.xip array?
> > >
> > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> > > to build the tuplecid hash or not based on whether it has catalog
> > > changes or not.
> > >
> >
> > Okay, but you haven't answered the second part of the question: "won't
> > it add the xid of top transaction needlessly in builder->committed.xip
> > array, see function SnapBuildCommitTxn?"  IIUC, this can happen
> > without patch as well because DecodeCommit also sets the flags just
> > based on invalidation messages irrespective of whether the messages
> > are generated by top transaction or not, is that right?
>
> Yes, with or without the patch it always adds the topxid.  I think
> purpose for doing this with/without patch is not for the snapshot
> instead we are marking the top itself that some of its subtxn has the
> catalog changes so that while building the tuplecid has we can know
> whether to build the hash or not.  But, having said that I feel in
> ReorderBufferBuildTupleCidHash why do we need these two checks
> if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> return;
>
> I mean it should be enough to just have the check,  because if we have
> added something to the tuplecids then catalog changes must be there
> because that time we are setting the catalog changes to true.
>
> if (dlist_is_empty(&txn->tuplecids))
> return;
>
> I think in the base code there are multiple things going on
> 1. If we get new CID we always set the catalog change in that
> transaction but add the tuplecids in the top transaction.  So
> basically, top transaction is so far not marked with catalog changes
> but it has tuplecids.
> 2. Now, in DecodeCommit the top xid will be marked that it has catalog
> changes based on the invalidation messages.
>

I don't think it is advisable to remove that check from base code
unless we have a strong reason for doing so.  I think here you can
write better comments about why you are marking the flag for top
transaction and remove TOCHECK from the comment.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, May 28, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Why we need memory-related to subxacts till the worker is alive?  As
> > we have now, after reading subxact info (subxact_info_read), we need
> > to ensure that it is freed after its usage due to which we need to
> > remember and perform pfree at various places.
> >
> > I think we should once see the possibility that such that we could
> > switch to this new context in start stream message and reset it in
> > stop stream message.  That might help in avoiding
> > MemoryContextSwitchTo TopMemoryContext at various places.
>
> Ok, I understand, I think subxacts can be allocated in new
> LogicalStreamingContext which we can reset at the stream stop.  How
> about xids?
>

How about storing xids in ApplyContext?  We do store similar lifespan
things in that context, for ex. see store_flush_position.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, May 28, 2020 at 3:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Why we need memory-related to subxacts till the worker is alive?  As
> > > we have now, after reading subxact info (subxact_info_read), we need
> > > to ensure that it is freed after its usage due to which we need to
> > > remember and perform pfree at various places.
> > >
> > > I think we should once see the possibility that such that we could
> > > switch to this new context in start stream message and reset it in
> > > stop stream message.  That might help in avoiding
> > > MemoryContextSwitchTo TopMemoryContext at various places.
> >
> > Ok, I understand, I think subxacts can be allocated in new
> > LogicalStreamingContext which we can reset at the stream stop.  How
> > about xids?
> >
>
> How about storing xids in ApplyContext?  We do store similar lifespan
> things in that context, for ex. see store_flush_position.

That sounds good to me,   I will make this change in the next patch
set, along with other changes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, May 28, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Isn't this problem only for subxact file as we anyway create changes
> > > file as part of start stream message which should have come after
> > > abort?  If so, can't we detect whether subxact file exists probably by
> > > using nsubxacts or something like that?  Can you please once try to
> > > reproduce this scenario to ensure that we are not missing anything?
> >
> > I have tested this, as of now, by default we create both changes and
> > subxact files irrespective of whether we get any subtransactions or
> > not.  Maybe this could be optimized that only if we have any subxact
> > then only create that file otherwise not?  What's your opinion on the
> > same.
> >
>
> Yeah, that makes sense.
>
> > > > > > > 8.
> > > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > > > > *rb, TransactionId xid,
> > > > > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > > > > >
> > > > > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > > > > + * if one of its children has.
> > > > > > > + */
> > > > > > > + if (txn->toptxn != NULL)
> > > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > >  }
> > > > > > >
> > > > > > > Why are we marking top transaction here?
> > > > > >
> > > > > > We need to mark top transaction to decide whether to build tuplecid
> > > > > > hash or not.  In non-streaming mode, we are only sending during the
> > > > > > commit time, and during commit time we know whether the top
> > > > > > transaction has any catalog changes or not based on the invalidation
> > > > > > message so we are marking the top transaction there in DecodeCommit.
> > > > > > Since here we are not waiting till commit so we need to mark the top
> > > > > > transaction as soon as we mark any of its child transactions.
> > > > > >
> > > > >
> > > > > But how does it help?  We use this flag (via
> > > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > > > > anyway done in DecodeCommit and that too after setting this flag for
> > > > > the top transaction if required.  So, how will it help in setting it
> > > > > while processing for subxid.  Also, even if we have to do it won't it
> > > > > add the xid needlessly in builder->committed.xip array?
> > > >
> > > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> > > > to build the tuplecid hash or not based on whether it has catalog
> > > > changes or not.
> > > >
> > >
> > > Okay, but you haven't answered the second part of the question: "won't
> > > it add the xid of top transaction needlessly in builder->committed.xip
> > > array, see function SnapBuildCommitTxn?"  IIUC, this can happen
> > > without patch as well because DecodeCommit also sets the flags just
> > > based on invalidation messages irrespective of whether the messages
> > > are generated by top transaction or not, is that right?
> >
> > Yes, with or without the patch it always adds the topxid.  I think
> > purpose for doing this with/without patch is not for the snapshot
> > instead we are marking the top itself that some of its subtxn has the
> > catalog changes so that while building the tuplecid has we can know
> > whether to build the hash or not.  But, having said that I feel in
> > ReorderBufferBuildTupleCidHash why do we need these two checks
> > if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > return;
> >
> > I mean it should be enough to just have the check,  because if we have
> > added something to the tuplecids then catalog changes must be there
> > because that time we are setting the catalog changes to true.
> >
> > if (dlist_is_empty(&txn->tuplecids))
> > return;
> >
> > I think in the base code there are multiple things going on
> > 1. If we get new CID we always set the catalog change in that
> > transaction but add the tuplecids in the top transaction.  So
> > basically, top transaction is so far not marked with catalog changes
> > but it has tuplecids.
> > 2. Now, in DecodeCommit the top xid will be marked that it has catalog
> > changes based on the invalidation messages.
> >
>
> I don't think it is advisable to remove that check from base code
> unless we have a strong reason for doing so.  I think here you can
> write better comments about why you are marking the flag for top
> transaction and remove TOCHECK from the comment.

Ok, I will do that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > Okay, sending again.
>
> While reviewing/testing I have found a couple of problems in 0005 and
> 0006 which I have fixed in the attached version.
>

I haven't reviewed the new fixes yet but I have some comments on
0008-Add-support-for-streaming-to-built-in-replicatio.patch.
1.
I think the temporary files (and or handles) used for storing the
information of changes and subxacts are getting leaked in the patch.
At some places, it is taken care to close the file but cases like
apply_handle_stream_commit where if any error occurred in
apply_dispatch(), the file might not get closed.  The other place is
in apply_handle_stream_abort() where if there is an error in ftruncate
the file won't be closed.   Now, the bigger problem is with changes
related file which is opened in apply_handle_stream_start and closed
in apply_handle_stream_stop and if there is any error in-between, we
won't close it.

OTOH, I think the worker will exit on an error so it might not matter
but then why we are at few other places we are closing it before the
error?  I think on error these temporary files should be removed
instead of relying on them to get removed next time when we receive
changes for the same transaction which I feel is what we do in other
cases where we use temporary files like for sorts or hashjoins.

Also, what if the changes file size overflows "OS file size limit"?
If we agree that the above are problems then do you think we should
explore using BufFile interface (see storage/file/buffile.c) to avoid
all such problems?

2.
apply_handle_stream_abort()
{
..
+ /* discard the subxacts added later */
+ nsubxacts = subidx;
+
+ /* write the updated subxact list */
+ subxact_info_write(MyLogicalRepWorker->subid, xid);
..
}

Here, if subxacts becomes zero, then also subxact_info_write will
create a new file and write checksum.  I think subxact_info_write
should have a check for nsubxacts > 0 before writing to the file.

3.
apply_handle_stream_commit(StringInfo s)
{
..
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);
..
}

Why do we need to send the feedback at this stage after applying each
message?  If we see a non-streamed case, we never send_feedback after
each message. So, following that, I don't see the need to send it here
but if you see any specific reason then do let me know?  And if we
have to send feedback, then we need to decide the appropriate values
as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> While reviewing/testing I have found a couple of problems in 0005 and
> 0006 which I have fixed in the attached version.
>
..
>
> In 0006: If we are streaming the serialized changed and there are
> still few incomplete changes, then currently we are not deleting the
> spilled file, but the spill file contains all the changes of the
> transaction because there is no way to partially truncate it.  So in
> the next stream, it will try to resend those.  I have fixed this by
> sending the spilled transaction as soon as its changes are complete so
> ideally, we can always delete the spilled file.  It is also a better
> solution because this transaction is already spilled once and that
> happened because we could not stream it,  so we better stream it on
> the first opportunity that will reduce the replay lag which is our
> whole purpose here.
>

I have reviewed these changes (in the patch
v25-0006-Bugfix-handling-of-incomplete-toast-spec-insert-) and below
are my comments.

1.
+ /*
+ * If the transaction is serialized and the the changes are complete in
+ * the top level transaction then immediately stream the transaction.
+ * The reason for not waiting for memory limit to get full is that in
+ * the streaming mode, if the transaction serialized that means we have
+ * already reached the memory limit but that time we could not stream
+ * this due to incomplete tuple so now stream it as soon as the tuple
+ * is complete.
+ */
+ if (rbtxn_is_serialized(txn))
+ ReorderBufferStreamTXN(rb, toptxn);

I think here it is important to explain why it is a must to stream a
prior serialized transaction as otherwise, later we won't be able to
know how to truncate a file.

2.
+ * If complete_truncate is set we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn if the transaction has
+ * incomplete changes.  Basically, complete_truncate is passed true only if
+ * concurrent abort is detected while processing the TXN.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+ bool partial_truncate)
 {

The description talks about complete_truncate flag whereas API is
using partial_truncate flag.  I think the description needs to be
changed.

3.
+ /* We have truncated upto last complete lsn so stop. */
+ if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
+ (change->lsn > toptxn->last_complete_lsn))
+ {
+ /*
+ * If this is a top transaction then we can reset the
+ * last_complete_lsn and complete_size, because by now we would
+ * have stream all the changes upto last_complete_lsn.
+ */
+ if (txn->toptxn == NULL)
+ {
+ toptxn->last_complete_lsn = InvalidXLogRecPtr;
+ toptxn->complete_size = 0;
+ }
+ break;
+ }

I think here we can add an Assert to ensure that we don't partially
truncate when the transaction is serialized and add comments for the
same.

4.
+ /*
+ * Subtract the processed changes from the nentries/nentries_mem Refer
+ * detailed comment atop this variable in ReorderBufferTXN structure.
+ * We do this only ff we are truncating the partial changes otherwise
+ * reset these values directly to 0.
+ */
+ if (partial_truncate)
+ {
+ txn->nentries -= txn->nprocessed;
+ txn->nentries_mem -= txn->nprocessed;
+ }
+ else
+ {
+ txn->nentries = 0;
+ txn->nentries_mem = 0;
+ }

I think we can write this comment as "Adjust nentries/nentries_mem
based on the changes processed.  See comments where nprocessed is
declared."

5.
+ /*
+ * In streaming mode, sometime we can't stream all the changes due to the
+ * incomplete changes.  So we can not directly reset the values of
+ * nentries/nentries_mem to 0 after one stream is sent like we do in
+ * non-streaming mode.  So while sending one stream we keep count of the
+ * changes processed in thi stream and only those many changes we decrement
+ * from the nentries/nentries_mem.
+ */
+ uint64 nprocessed;

How about something like: "Number of changes processed.  This is used
to keep track of changes that remained to be streamed.  As of now,
this can happen either due to toast tuples or speculative insertions
where we need to wait for multiple changes before we can send them."

6.
+ /* Size of the commplete changes. */
+ Size complete_size;

Typo. /commplete/complete

7.
+ /*
+ * Increment the nprocessed count.  See the detailed comment
+ * for usage of this in ReorderBufferTXN structure.
+ */
+ change->txn->nprocessed++;

Ideally, this has to be incremented after processing the change.  So,
we can combine it with existing check in the patch as below:

if (streaming)
{
   change->txn->nprocessed++;

  if (rbtxn_has_incomplete_tuple(txn) &&
prev_lsn == txn->last_complete_lsn)
{
/* Only in streaming mode we should get here. */
Assert(streaming);
partial_truncate = true;
break;
}
}

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;

Test parameters:
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'

Test results:

CREATE index operationsAdd col int(date) operationsAdd col text operations
SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN change
1
1 DDL without patch177280.89116
1.624548
9760.764393
11.475409
339040.80044
2.80792
with patch180160.80486810880.763602348560.787108
2
2 DDL without patch198720.860348
2.73752
16320.763199
13.7254902
345600.806086
3.078703
with patch204160.83906518560.733147356240.829281
3
3 DDL without patch220160.894891
3.63372093
22880.776871
14.685314
352160.803493
3.339391186
with patch228160.82802826240.737177363920.800194
4
4 DDL without patch241600.901686
4.4701986
29440.768445
15.217391
358720.77489
3.590544
with patch252400.88714333920.768382371600.82777
5
5 DDL without patch263280.901686
4.9832877
36000.751879
15.555555
365280.817928
3.832676
with patch276400.91407841600.74709379280.820621
6
6 DDL without patch284720.936385
5.5071649
42560.745179
15.78947368
371840.797043
4.066265
with patch300400.95822649280.725321386960.814535
7
8 DDL without patch327601.0022203
6.422466
55680.757468
16.091954
384960.83207
4.509559
with patch348640.96677764640.769072402320.903604
8
11 DDL without patch502961.0022203
5.662478
75360.748332
16.666666
404640.822266
5.179913
with patch531440.96677787920.750553425600.797133
9
15 DDL without patch588961.267253
5.662478
101840.776875
16.496465
431120.821916
5.84524
with patch627681.27234118640.746844456320.812567
10
1 DDL & 3 DML without patch182400.812551
1.6228
11920.771993
10.067114
341200.849467
2.8113599
with patch185360.81908913120.785117350800.855456
11
3 DDL & 5 DML without patch236560.926616
3.4832606
26560.758029
13.55421687
355840.829377
3.372302
with patch244800.91551730160.797206367840.839176
12
10 DDL & 5 DML without patch527601.101005
4.958301744
72880.763065
16.02634468
402160.837843
4.993037
with patch553761.10524184560.779257422240.835206
13
10 DML without patch10080.791091
6.349206
10080.81105
6.349206
10080.78817
6.349206
with patch10720.80787510720.77111310720.759789

To see all operations, please see[2] test_results


Why are you seeing any additional WAL in case-13 (10 DML) where there is no DDL?  I think it is because you have used savepoints in that case which will add some additional WAL.  You seems to have 9 savepoints in that test which should ideally generate 36 bytes of additional WAL (4-byte per transaction id for each subtransaction).  Also, in other cases where you took data for DDL and DML, you have also used savepoints in those tests. I suggest for savepoints, let's do separate tests as you have done in case-13 but we can do it 3,5,7,10 savepoints and probably each transaction can update a row of 200 bytes or so.

I think you can take data for somewhat more realistic cases of DDL and DML combination like 3 DDL's with 10 DML and 3 DDL's with 15 DML operations.  In general, I think we will see many more DML's per DDL.  It is good to see the worst-case WAL and performance overhead as you have done.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, May 29, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > While reviewing/testing I have found a couple of problems in 0005 and
> > 0006 which I have fixed in the attached version.
> >
> ..
> >
> > In 0006: If we are streaming the serialized changed and there are
> > still few incomplete changes, then currently we are not deleting the
> > spilled file, but the spill file contains all the changes of the
> > transaction because there is no way to partially truncate it.  So in
> > the next stream, it will try to resend those.  I have fixed this by
> > sending the spilled transaction as soon as its changes are complete so
> > ideally, we can always delete the spilled file.  It is also a better
> > solution because this transaction is already spilled once and that
> > happened because we could not stream it,  so we better stream it on
> > the first opportunity that will reduce the replay lag which is our
> > whole purpose here.
> >
>
> I have reviewed these changes (in the patch
> v25-0006-Bugfix-handling-of-incomplete-toast-spec-insert-) and below
> are my comments.
>
> 1.
> + /*
> + * If the transaction is serialized and the the changes are complete in
> + * the top level transaction then immediately stream the transaction.
> + * The reason for not waiting for memory limit to get full is that in
> + * the streaming mode, if the transaction serialized that means we have
> + * already reached the memory limit but that time we could not stream
> + * this due to incomplete tuple so now stream it as soon as the tuple
> + * is complete.
> + */
> + if (rbtxn_is_serialized(txn))
> + ReorderBufferStreamTXN(rb, toptxn);
>
> I think here it is important to explain why it is a must to stream a
> prior serialized transaction as otherwise, later we won't be able to
> know how to truncate a file.

Done

> 2.
> + * If complete_truncate is set we completely truncate the transaction,
> + * otherwise we truncate upto last_complete_lsn if the transaction has
> + * incomplete changes.  Basically, complete_truncate is passed true only if
> + * concurrent abort is detected while processing the TXN.
>   */
>  static void
> -ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
> + bool partial_truncate)
>  {
>
> The description talks about complete_truncate flag whereas API is
> using partial_truncate flag.  I think the description needs to be
> changed.

Fixed

> 3.
> + /* We have truncated upto last complete lsn so stop. */
> + if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
> + (change->lsn > toptxn->last_complete_lsn))
> + {
> + /*
> + * If this is a top transaction then we can reset the
> + * last_complete_lsn and complete_size, because by now we would
> + * have stream all the changes upto last_complete_lsn.
> + */
> + if (txn->toptxn == NULL)
> + {
> + toptxn->last_complete_lsn = InvalidXLogRecPtr;
> + toptxn->complete_size = 0;
> + }
> + break;
> + }
>
> I think here we can add an Assert to ensure that we don't partially
> truncate when the transaction is serialized and add comments for the
> same.

Done

> 4.
> + /*
> + * Subtract the processed changes from the nentries/nentries_mem Refer
> + * detailed comment atop this variable in ReorderBufferTXN structure.
> + * We do this only ff we are truncating the partial changes otherwise
> + * reset these values directly to 0.
> + */
> + if (partial_truncate)
> + {
> + txn->nentries -= txn->nprocessed;
> + txn->nentries_mem -= txn->nprocessed;
> + }
> + else
> + {
> + txn->nentries = 0;
> + txn->nentries_mem = 0;
> + }
>
> I think we can write this comment as "Adjust nentries/nentries_mem
> based on the changes processed.  See comments where nprocessed is
> declared."
>
> 5.
> + /*
> + * In streaming mode, sometime we can't stream all the changes due to the
> + * incomplete changes.  So we can not directly reset the values of
> + * nentries/nentries_mem to 0 after one stream is sent like we do in
> + * non-streaming mode.  So while sending one stream we keep count of the
> + * changes processed in thi stream and only those many changes we decrement
> + * from the nentries/nentries_mem.
> + */
> + uint64 nprocessed;
>
> How about something like: "Number of changes processed.  This is used
> to keep track of changes that remained to be streamed.  As of now,
> this can happen either due to toast tuples or speculative insertions
> where we need to wait for multiple changes before we can send them."

Done

> 6.
> + /* Size of the commplete changes. */
> + Size complete_size;
>
> Typo. /commplete/complete
>
> 7.
> + /*
> + * Increment the nprocessed count.  See the detailed comment
> + * for usage of this in ReorderBufferTXN structure.
> + */
> + change->txn->nprocessed++;
>
> Ideally, this has to be incremented after processing the change.  So,
> we can combine it with existing check in the patch as below:
>
> if (streaming)
> {
>    change->txn->nprocessed++;
>
>   if (rbtxn_has_incomplete_tuple(txn) &&
> prev_lsn == txn->last_complete_lsn)
> {
> /* Only in streaming mode we should get here. */
> Assert(streaming);
> partial_truncate = true;
> break;
> }
> }

Done

Apart from this, there was one more issue in this patch
+ if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
+ (change->lsn > toptxn->last_complete_lsn))
+ {
+ /*
+ * If this is a top transaction then we can reset the
+ * last_complete_lsn and complete_size, because by now we would
+ * have stream all the changes upto last_complete_lsn.
+ */
+ if (txn->toptxn == NULL)
+ {
+ toptxn->last_complete_lsn = InvalidXLogRecPtr;
+ toptxn->complete_size = 0;
+ }
+ break;

We shall reset toptxn->last_complete_lsn and toptxn->complete_size,
outside this {(change->lsn > toptxn->last_complete_lsn)} check,
because we might be in subxact when we meet this condition, so in that
case, for toptxn we never reach here and it will never get reset, I
have fixed this.

Apart from this one more fix in 0005,  basically, CheckLiveXid was
never reset, so I have fixed that as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, May 28, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Isn't this problem only for subxact file as we anyway create changes
> > > file as part of start stream message which should have come after
> > > abort?  If so, can't we detect whether subxact file exists probably by
> > > using nsubxacts or something like that?  Can you please once try to
> > > reproduce this scenario to ensure that we are not missing anything?
> >
> > I have tested this, as of now, by default we create both changes and
> > subxact files irrespective of whether we get any subtransactions or
> > not.  Maybe this could be optimized that only if we have any subxact
> > then only create that file otherwise not?  What's your opinion on the
> > same.
> >
>
> Yeah, that makes sense.
>
> > > > > > > 8.
> > > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > > > > *rb, TransactionId xid,
> > > > > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > > > > >
> > > > > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > > > > + * if one of its children has.
> > > > > > > + */
> > > > > > > + if (txn->toptxn != NULL)
> > > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > >  }
> > > > > > >
> > > > > > > Why are we marking top transaction here?
> > > > > >
> > > > > > We need to mark top transaction to decide whether to build tuplecid
> > > > > > hash or not.  In non-streaming mode, we are only sending during the
> > > > > > commit time, and during commit time we know whether the top
> > > > > > transaction has any catalog changes or not based on the invalidation
> > > > > > message so we are marking the top transaction there in DecodeCommit.
> > > > > > Since here we are not waiting till commit so we need to mark the top
> > > > > > transaction as soon as we mark any of its child transactions.
> > > > > >
> > > > >
> > > > > But how does it help?  We use this flag (via
> > > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > > > > anyway done in DecodeCommit and that too after setting this flag for
> > > > > the top transaction if required.  So, how will it help in setting it
> > > > > while processing for subxid.  Also, even if we have to do it won't it
> > > > > add the xid needlessly in builder->committed.xip array?
> > > >
> > > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> > > > to build the tuplecid hash or not based on whether it has catalog
> > > > changes or not.
> > > >
> > >
> > > Okay, but you haven't answered the second part of the question: "won't
> > > it add the xid of top transaction needlessly in builder->committed.xip
> > > array, see function SnapBuildCommitTxn?"  IIUC, this can happen
> > > without patch as well because DecodeCommit also sets the flags just
> > > based on invalidation messages irrespective of whether the messages
> > > are generated by top transaction or not, is that right?
> >
> > Yes, with or without the patch it always adds the topxid.  I think
> > purpose for doing this with/without patch is not for the snapshot
> > instead we are marking the top itself that some of its subtxn has the
> > catalog changes so that while building the tuplecid has we can know
> > whether to build the hash or not.  But, having said that I feel in
> > ReorderBufferBuildTupleCidHash why do we need these two checks
> > if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > return;
> >
> > I mean it should be enough to just have the check,  because if we have
> > added something to the tuplecids then catalog changes must be there
> > because that time we are setting the catalog changes to true.
> >
> > if (dlist_is_empty(&txn->tuplecids))
> > return;
> >
> > I think in the base code there are multiple things going on
> > 1. If we get new CID we always set the catalog change in that
> > transaction but add the tuplecids in the top transaction.  So
> > basically, top transaction is so far not marked with catalog changes
> > but it has tuplecids.
> > 2. Now, in DecodeCommit the top xid will be marked that it has catalog
> > changes based on the invalidation messages.
> >
>
> I don't think it is advisable to remove that check from base code
> unless we have a strong reason for doing so.  I think here you can
> write better comments about why you are marking the flag for top
> transaction and remove TOCHECK from the comment.

Done.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 26, 2020 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > > > was 0006).
> > > >
> > >
> > > The code changes look fine but it is not clear what was the exact
> > > issue.  Can you explain?
> >
> > Basically, in case of an empty subtransaction, we were reading the
> > subxacts info but when we could not find the subxid in the subxacts
> > info we were not releasing the memory.  So on next subxact_info_read
> > it will expect that subxacts should be freed but we did not free it in
> > that !found case.
> >
>
> Okay, on looking at it again, the same code exists in
> subxact_info_write as well.  It is better to have a function for it.
> Can we have a structure like SubXactContext for all the variables used
> for subxact?  As mentioned earlier I find the allocation/deallocation
> of subxacts a bit ad-hoc, so there will always be a chance that we can
> forget to free it.  Having it allocated in memory context which we can
> reset later might reduce that risk.  One idea could be that we have a
> special memory context for start and stop messages which can be used
> to allocate the subxacts there.  In case of commit/abort, we can allow
> subxacts information to be allocated in ApplyMessageContext which is
> reset at the end of each protocol message.

Changed as per this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 4.
> > > + * XXX Do we need to allocate it in TopMemoryContext?
> > > + */
> > > +static void
> > > +subxact_info_add(TransactionId xid)
> > > {
> > > ..
> > >
> > > For this and other places in a patch like in function
> > > stream_open_file(), instead of using TopMemoryContext, can we consider
> > > using a new memory context LogicalStreamingContext or something like
> > > that. We can create LogicalStreamingContext under TopMemoryContext.  I
> > > don't see any need of using TopMemoryContext here.
> >
> > But, when we will delete/reset the LogicalStreamingContext?
> >
>
> Why can't we reset it at each stream stop message?

Done this

>
> >  because
> > we are planning to keep this memory until the worker is alive so that
> > supposed to be the top memory context.
> >
>
> Which part of allocation do we want to keep till the worker is alive?
> Why we need memory-related to subxacts till the worker is alive?  As
> we have now, after reading subxact info (subxact_info_read), we need
> to ensure that it is freed after its usage due to which we need to
> remember and perform pfree at various places.
>
> I think we should once see the possibility that such that we could
> switch to this new context in start stream message and reset it in
> stop stream message.  That might help in avoiding
> MemoryContextSwitchTo TopMemoryContext at various places.
>
> >  If we create any other context
> > with the same life span as TopMemoryContext then what is the point?
> >
>
> It is helpful for debugging.  It is recommended that we don't use the
> top memory context unless it is really required.  Read about it in
> src/backend/utils/mmgr/README.

xids is now allocated in ApplyContext

> > > 8.
> > > + * XXX Maybe we should only include the checksum when the cluster is
> > > + * initialized with checksums?
> > > + */
> > > +static void
> > > +subxact_info_write(Oid subid, TransactionId xid)
> > >
> > > Do we really need to have the checksum for temporary files? I have
> > > checked a few other similar cases like SharedFileSet stuff for
> > > parallel hash join but didn't find them using checksums.  Can you also
> > > once see other usages of temporary files and then let us decide if we
> > > see any reason to have checksums for this?
> >
> > Yeah, even I can see other places checksum is not used.
> >
>
> So, unless someone speaks up before you are ready for the next version
> of the patch, can we remove it?

Done
> > > Another point is we don't seem to be doing this for 'changes' file,
> > > see stream_write_change.  So, not sure, there is any sense to write
> > > checksum for subxact file.
> >
> > I can see there are comment atop this function
> >
> > * XXX The subxact file includes CRC32C of the contents. Maybe we should
> > * include something like that here too, but doing so will not be as
> > * straighforward, because we write the file in chunks.
> >
>
> You can remove this comment as well.  I don't know how advantageous it
> is to checksum temporary files.  We can anyway add it later if there
> is a reason for doing so.

Done

> >
> > > 12.
> > > maybe_send_schema()
> > > {
> > > ..
> > > + if (in_streaming)
> > > + {
> > > + /*
> > > + * TOCHECK: We have to send schema after each catalog change and it may
> > > + * occur when streaming already started, so we have to track new catalog
> > > + * changes somehow.
> > > + */
> > > + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
> > > ..
> > > ..
> > > }
> > >
> > > I think it is good to once verify/test what this comment says but as
> > > per code we should be sending the schema after each catalog change as
> > > we invalidate the streamed_txns list in rel_sync_cache_relation_cb
> > > which must be called during relcache invalidation.  Do we see any
> > > problem with that mechanism?
> >
> > I have tested this, I think we are already sending the schema after
> > each catalog change.
> >
>
> Then remove "TOCHECK" in the above comment.

Done


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > >
> > > Okay, sending again.
> >
> > While reviewing/testing I have found a couple of problems in 0005 and
> > 0006 which I have fixed in the attached version.
> >
>
> I haven't reviewed the new fixes yet but I have some comments on
> 0008-Add-support-for-streaming-to-built-in-replicatio.patch.
> 1.
> I think the temporary files (and or handles) used for storing the
> information of changes and subxacts are getting leaked in the patch.
> At some places, it is taken care to close the file but cases like
> apply_handle_stream_commit where if any error occurred in
> apply_dispatch(), the file might not get closed.  The other place is
> in apply_handle_stream_abort() where if there is an error in ftruncate
> the file won't be closed.   Now, the bigger problem is with changes
> related file which is opened in apply_handle_stream_start and closed
> in apply_handle_stream_stop and if there is any error in-between, we
> won't close it.
>
> OTOH, I think the worker will exit on an error so it might not matter
> but then why we are at few other places we are closing it before the
> error?  I think on error these temporary files should be removed
> instead of relying on them to get removed next time when we receive
> changes for the same transaction which I feel is what we do in other
> cases where we use temporary files like for sorts or hashjoins.
>
> Also, what if the changes file size overflows "OS file size limit"?
> If we agree that the above are problems then do you think we should
> explore using BufFile interface (see storage/file/buffile.c) to avoid
> all such problems?

I also think that the file size is a problem.  I think we can use
BufFile with some modifications.  We can not use the
BufFileCreateTemp, because of few reasons
1) files get deleted on close, but we have to open/close on every
stream start/stop.
2) even if we try to avoid closing we need to the BufFile pointers
(which take 8192k per file) because there is no option to pass the
file name.

I thin for our use case BufFileCreateShared is more suitable.  I think
we need to do some modifications so that we can use these apps without
SharedFileSet. Otherwise, we need to unnecessarily need to create
SharedFileSet for each transaction and also need to maintain it in xid
array or xid hash until transaction commit/abort.  So I suggest
following modifications in shared files set so that we can
conveniently use it.
1. ChooseTablespace(const SharedFileSet fileset, const char name)
  if fileset is NULL then select the DEFAULTTABLESPACEOID
2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
If fileset is NULL then in directory path we can use MyProcPID or
something instead of fileset->creator_pid.
3. Pass some parameter to BufFileOpenShared, so that it can open the
file in RW mode instead of read-only mode.


> 2.
> apply_handle_stream_abort()
> {
> ..
> + /* discard the subxacts added later */
> + nsubxacts = subidx;
> +
> + /* write the updated subxact list */
> + subxact_info_write(MyLogicalRepWorker->subid, xid);
> ..
> }
>
> Here, if subxacts becomes zero, then also subxact_info_write will
> create a new file and write checksum.

How, will it create the new file, in fact it will write nsubxacts as 0
in the existing file, and I think we need to do that right so that in
next open we will know that the nsubxact is 0.

  I think subxact_info_write
> should have a check for nsubxacts > 0 before writing to the file.

But, even if nsubxacts become 0 we want to write the file so that we
can overwrite the previous info.

> 3.
> apply_handle_stream_commit(StringInfo s)
> {
> ..
> + /*
> + * send feedback to upstream
> + *
> + * XXX Probably should send a valid LSN. But which one?
> + */
> + send_feedback(InvalidXLogRecPtr, false, false);
> ..
> }
>
> Why do we need to send the feedback at this stage after applying each
> message?  If we see a non-streamed case, we never send_feedback after
> each message. So, following that, I don't see the need to send it here
> but if you see any specific reason then do let me know?  And if we
> have to send feedback, then we need to decide the appropriate values
> as well.

Let me put more thought on this and then I will revert back to you.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Also, what if the changes file size overflows "OS file size limit"?
> > If we agree that the above are problems then do you think we should
> > explore using BufFile interface (see storage/file/buffile.c) to avoid
> > all such problems?
>
> I also think that the file size is a problem.  I think we can use
> BufFile with some modifications.  We can not use the
> BufFileCreateTemp, because of few reasons
> 1) files get deleted on close, but we have to open/close on every
> stream start/stop.
> 2) even if we try to avoid closing we need to the BufFile pointers
> (which take 8192k per file) because there is no option to pass the
> file name.
>
> I thin for our use case BufFileCreateShared is more suitable.  I think
> we need to do some modifications so that we can use these apps without
> SharedFileSet. Otherwise, we need to unnecessarily need to create
> SharedFileSet for each transaction and also need to maintain it in xid
> array or xid hash until transaction commit/abort.  So I suggest
> following modifications in shared files set so that we can
> conveniently use it.
> 1. ChooseTablespace(const SharedFileSet fileset, const char name)
>   if fileset is NULL then select the DEFAULTTABLESPACEOID
> 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> If fileset is NULL then in directory path we can use MyProcPID or
> something instead of fileset->creator_pid.
>

Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
better than the patch maintains sharedfileset information.

> 3. Pass some parameter to BufFileOpenShared, so that it can open the
> file in RW mode instead of read-only mode.
>

This seems okay.

>
> > 2.
> > apply_handle_stream_abort()
> > {
> > ..
> > + /* discard the subxacts added later */
> > + nsubxacts = subidx;
> > +
> > + /* write the updated subxact list */
> > + subxact_info_write(MyLogicalRepWorker->subid, xid);
> > ..
> > }
> >
> > Here, if subxacts becomes zero, then also subxact_info_write will
> > create a new file and write checksum.
>
> How, will it create the new file, in fact it will write nsubxacts as 0
> in the existing file, and I think we need to do that right so that in
> next open we will know that the nsubxact is 0.
>
>   I think subxact_info_write
> > should have a check for nsubxacts > 0 before writing to the file.
>
> But, even if nsubxacts become 0 we want to write the file so that we
> can overwrite the previous info.
>

Can't we just remove the file for such a case?

apply_handle_stream_abort()
{
..
+ /* XXX optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
+
+ /*
+ * If it's an empty sub-transaction then we will not find the subxid
+ * here so just free the memory and return.
+ */
+ if (!found)
+ {
+ /* Free the subxacts memory */
+ if (subxacts)
+ pfree(subxacts);
+
+ subxacts = NULL;
+ subxact_last = InvalidTransactionId;
+ nsubxacts = 0;
+ nsubxacts_max = 0;
+
+ return;
+ }
..
}

I have one question regarding the above code.  Isn't it possible that
a particular subtransaction id doesn't have any change but others do
we have?  For ex. cases like below:

postgres=# begin;
BEGIN
postgres=*# insert into t1 values(1);
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(2);
INSERT 0 1
postgres=*# insert into t1 values(3);
INSERT 0 1
postgres=*# Rollback to savepoint s1;
ROLLBACK
postgres=*# commit;

Here, we have performed Rolledback to savepoint s1 which doesn't have
any change of its own.  I think this would have handled but just
wanted to confirm.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Also, what if the changes file size overflows "OS file size limit"?
> > > If we agree that the above are problems then do you think we should
> > > explore using BufFile interface (see storage/file/buffile.c) to avoid
> > > all such problems?
> >
> > I also think that the file size is a problem.  I think we can use
> > BufFile with some modifications.  We can not use the
> > BufFileCreateTemp, because of few reasons
> > 1) files get deleted on close, but we have to open/close on every
> > stream start/stop.
> > 2) even if we try to avoid closing we need to the BufFile pointers
> > (which take 8192k per file) because there is no option to pass the
> > file name.
> >
> > I thin for our use case BufFileCreateShared is more suitable.  I think
> > we need to do some modifications so that we can use these apps without
> > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > SharedFileSet for each transaction and also need to maintain it in xid
> > array or xid hash until transaction commit/abort.  So I suggest
> > following modifications in shared files set so that we can
> > conveniently use it.
> > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > If fileset is NULL then in directory path we can use MyProcPID or
> > something instead of fileset->creator_pid.
> >
>
> Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> better than the patch maintains sharedfileset information.

I think we might do something better here, maybe by supplying function
pointer or so,  but maintaining sharedfileset which contains different
tablespace/mutext which we don't need at all for our purpose also
doesn't sound very appealing.  Let me see if I can not come up with
some clean way of avoiding the need to shared-fileset then maybe we
can go with the shared fileset idea.

> > 3. Pass some parameter to BufFileOpenShared, so that it can open the
> > file in RW mode instead of read-only mode.
> >
>
> This seems okay.
>
> >
> > > 2.
> > > apply_handle_stream_abort()
> > > {
> > > ..
> > > + /* discard the subxacts added later */
> > > + nsubxacts = subidx;
> > > +
> > > + /* write the updated subxact list */
> > > + subxact_info_write(MyLogicalRepWorker->subid, xid);
> > > ..
> > > }
> > >
> > > Here, if subxacts becomes zero, then also subxact_info_write will
> > > create a new file and write checksum.
> >
> > How, will it create the new file, in fact it will write nsubxacts as 0
> > in the existing file, and I think we need to do that right so that in
> > next open we will know that the nsubxact is 0.
> >
> >   I think subxact_info_write
> > > should have a check for nsubxacts > 0 before writing to the file.
> >
> > But, even if nsubxacts become 0 we want to write the file so that we
> > can overwrite the previous info.
> >
>
> Can't we just remove the file for such a case?

But, as of now, we expect if it is not a first-time stream start then
the file exists.    Actually, currently, it's very easy that if it is
not the first segment we always expect that the file must exist,
otherwise an error.   Now if it is not the first segment then we will
need to handle multiple cases.

a) subxact_info_read need to handle the error case, because the file
may not exist because there was no subxact in last stream or it was
deleted because nsubxact become 0.
b) subxact_info_write,  there will be multiple cases that if nsubxact
was already 0 then we can avoid writing the file, but if it become 0
now we need to remove the file.

Let me think more on that.


>
> apply_handle_stream_abort()
> {
> ..
> + /* XXX optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> + {
> + if (subxacts[i - 1].xid == subxid)
> + {
> + subidx = (i - 1);
> + found = true;
> + break;
> + }
> + }
> +
> + /*
> + * If it's an empty sub-transaction then we will not find the subxid
> + * here so just free the memory and return.
> + */
> + if (!found)
> + {
> + /* Free the subxacts memory */
> + if (subxacts)
> + pfree(subxacts);
> +
> + subxacts = NULL;
> + subxact_last = InvalidTransactionId;
> + nsubxacts = 0;
> + nsubxacts_max = 0;
> +
> + return;
> + }
> ..
> }
>
> I have one question regarding the above code.  Isn't it possible that
> a particular subtransaction id doesn't have any change but others do
> we have?  For ex. cases like below:
>
> postgres=# begin;
> BEGIN
> postgres=*# insert into t1 values(1);
> INSERT 0 1
> postgres=*# savepoint s1;
> SAVEPOINT
> postgres=*# savepoint s2;
> SAVEPOINT
> postgres=*# insert into t1 values(2);
> INSERT 0 1
> postgres=*# insert into t1 values(3);
> INSERT 0 1
> postgres=*# Rollback to savepoint s1;
> ROLLBACK
> postgres=*# commit;
>
> Here, we have performed Rolledback to savepoint s1 which doesn't have
> any change of its own.  I think this would have handled but just
> wanted to confirm.

But internally, that will send abort for the s2 first, and for that,
we will find xid and truncate, and later we will send abort for s1 but
that we will not find and do nothing?  Anyway, I will test it and let
you know.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > we need to do some modifications so that we can use these apps without
> > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > SharedFileSet for each transaction and also need to maintain it in xid
> > > array or xid hash until transaction commit/abort.  So I suggest
> > > following modifications in shared files set so that we can
> > > conveniently use it.
> > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > If fileset is NULL then in directory path we can use MyProcPID or
> > > something instead of fileset->creator_pid.
> > >
> >
> > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > better than the patch maintains sharedfileset information.
>
> I think we might do something better here, maybe by supplying function
> pointer or so,  but maintaining sharedfileset which contains different
> tablespace/mutext which we don't need at all for our purpose also
> doesn't sound very appealing.
>

I think we can say something similar for Relation (rel cache entry as
well) maintained in LogicalRepRelMapEntry.  I think we only need a
pointer to that information.

>  Let me see if I can not come up with
> some clean way of avoiding the need to shared-fileset then maybe we
> can go with the shared fileset idea.
>

Fair enough.
..

> > >
> > > But, even if nsubxacts become 0 we want to write the file so that we
> > > can overwrite the previous info.
> > >
> >
> > Can't we just remove the file for such a case?
>
> But, as of now, we expect if it is not a first-time stream start then
> the file exists.
>

Isn't it primarily because we do subxact_info_write in stop stream
which will create such a file irrespective of whether we have any
subxacts?  If so, isn't that an unnecessary write?

>    Actually, currently, it's very easy that if it is
> not the first segment we always expect that the file must exist,
> otherwise an error.
>

I think we can check if the file doesn't exist then we can initialize
nsubxacts as 0.

>   Now if it is not the first segment then we will
> need to handle multiple cases.
>
> a) subxact_info_read need to handle the error case, because the file
> may not exist because there was no subxact in last stream or it was
> deleted because nsubxact become 0.
> b) subxact_info_write,  there will be multiple cases that if nsubxact
> was already 0 then we can avoid writing the file, but if it become 0
> now we need to remove the file.
>
> Let me think more on that.
>

I feel we should be able to deal with these cases but if you find any
difficulty then let us discuss.  I understand there is some ease if we
always have subxacts file but OTOH it sounds quite awkward that we
need so many file operations to detect the case whether the
transaction has any subtransactions.

> >
> > Here, we have performed Rolledback to savepoint s1 which doesn't have
> > any change of its own.  I think this would have handled but just
> > wanted to confirm.
>
> But internally, that will send abort for the s2 first, and for that,
> we will find xid and truncate, and later we will send abort for s1 but
> that we will not find and do nothing?  Anyway, I will test it and let
> you know.
>

It would be good if we can test and confirm this behavior once.  If it
is not very inconvenient then we can even try to include a test for
the same in the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>

The fixes in the latest patchset are correct.  Few minor comments:
v26-0005-Implement-streaming-mode-in-ReorderBuffer
+ /*
+ * Mark toplevel transaction as having catalog changes too if one of its
+ * children has so that the ReorderBufferBuildTupleCidHash can conveniently
+ * check just toplevel transaction and decide whethe we need to build the
+ * hash table or not.  In non-streaming mode we mark the toplevel
+ * transaction in DecodeCommit as we only stream on commit.

Typo, /whethe/whether
missing comma, /In non-streaming mode we/In non-streaming mode, we

v26-0008-Add-support-for-streaming-to-built-in-replicatio
+ /*
+ * This memory context used for per stream data when streaming mode is
+ * enabled.  This context is reeset on each stream stop.
+ */

Can we slightly modify the above comment as "This is used in the
streaming mode for the changes between the start and stop stream
messages.  We reset this context on the stream stop message."?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > > we need to do some modifications so that we can use these apps without
> > > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > > SharedFileSet for each transaction and also need to maintain it in xid
> > > > array or xid hash until transaction commit/abort.  So I suggest
> > > > following modifications in shared files set so that we can
> > > > conveniently use it.
> > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > > If fileset is NULL then in directory path we can use MyProcPID or
> > > > something instead of fileset->creator_pid.
> > > >
> > >
> > > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > > better than the patch maintains sharedfileset information.
> >
> > I think we might do something better here, maybe by supplying function
> > pointer or so,  but maintaining sharedfileset which contains different
> > tablespace/mutext which we don't need at all for our purpose also
> > doesn't sound very appealing.
> >
>
> I think we can say something similar for Relation (rel cache entry as
> well) maintained in LogicalRepRelMapEntry.  I think we only need a
> pointer to that information.

Yeah, I see.

> >  Let me see if I can not come up with
> > some clean way of avoiding the need to shared-fileset then maybe we
> > can go with the shared fileset idea.
> >
>
> Fair enough.

While evaluating it further I feel there are a few more problems to
solve if we are using BufFile,  First thing is that in subxact file we
maintain the information of xid and its offset in the changes file.
So now, we will also have to store 'fileno' but that we can find using
BufFileTell.  Yet another problem is that currently, we don't
have the truncate option in the BufFile,  but we need it if the
sub-transaction gets aborted.  I think we can implement an extra
interface with the BufFile and should not be very hard as we already
know the fileno and the offset.  I will evaluate this part further and
let you know about the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Mahendra Singh Thalor
Дата:
On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>>
>> On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Hi all,
>> On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;
>>
>> Test parameters:
>> wal_level= 'logical
>> max_connections = '150'
>> wal_receiver_timeout = '600s'
>> max_wal_size = '2GB'
>> min_wal_size = '2GB'
>> autovacuum= 'off'
>> checkpoint_timeout= '1d'
>>
>> Test results:
>>
>> CREATE index operationsAdd col int(date) operationsAdd col text operations
>> SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN change
>> 1
>> 1 DDL without patch177280.89116
>> 1.624548
>> 9760.764393
>> 11.475409
>> 339040.80044
>> 2.80792
>> with patch180160.80486810880.763602348560.787108
>> 2
>> 2 DDL without patch198720.860348
>> 2.73752
>> 16320.763199
>> 13.7254902
>> 345600.806086
>> 3.078703
>> with patch204160.83906518560.733147356240.829281
>> 3
>> 3 DDL without patch220160.894891
>> 3.63372093
>> 2 2880.776871
>> 14.685314
>> 352160.803493
>> 3.339391186
>> with patch228160.82802826240.737177363920.800194
>> 4
>> 4 DDL without patch241600.901686
>> 4.4701986
>> 29440.768445
>> 15.217391
>> 358720.77489
>> 3.590544
>> with patch252400.88714333920.768382371600.82777
>> 5
>> 5 DDL without patch263280.901686
>> 4.9832877
>> 36000.751879
>> 15.555555
>> 365280.817928
>> 3.832676
>> with patch276400.91407841600.74709379280.820621
>> 6
>> 6 DDL without patch284720.936385
>> 5.5071649
>> 42560.745179
>> 15.78947368
>> 371840.797043
>> 4.066265
>> with patch300400.95822649280.725321386960.814535
>> 7
>> 8 DDL without patch327601.0022203
>> 6.422466
>> 55680.757468
>> 16.091954
>> 384960.83207
>> 4.509559
>> with patch348640.96677764640.769072402320.903604
>> 8
>> 11 DDL without patch502961.0022203
>> 5.662478
>> 75360.748332
>> 16.666666
>> 404640.822266
>> 5.179913
>> with patch531440.96677787920.750553425600.797133
>> 9
>> 15 DDL without patch588961.267253
>> 5.662478
>> 101840.776875
>> 16.496465
>> 431120.821916
>> 5.84524
>> with patch627681.27234118640.746844456320.812567
>> 10
>> 1 DDL & 3 DML without patch182400.812551
>> 1.6228
>> 11920.771993
>> 10.067114
>> 341200.849467
>> 2.8113599
>> with patch185360.81908913120.785117350800.855456
>> 11
>> 3 DDL & 5 DML without patch236560.926616
>> 3.4832606
>> 26560.758029
>> 13.55421687
>> 355840.829377
>> 3.372302
>> with patch244800.91551730160.797206367840.839176
>> 12
>> 10 DDL & 5 DML without patch527601.101005
>> 4.958301744
>> 72880.763065
>> 16.02634468
>> 402160.837843
>> 4.993037
>> with patch553761.10524184560.779257422240.835206
>> 13
>> 10 DML without patch10080.791091
>> 6.349206
>> 10080.81105
>> 6.349206
>> 10080.78817
>> 6.349206
>> with patch10720.80787510720.77111310720.759789
>>
>> To see all operations, please see[2] test_results
>>
>
> Why are you seeing any additional WAL in case-13 (10 DML) where there is no DDL?  I think it is because you have used savepoints in that case which will add some additional WAL.  You seems to have 9 savepoints in that test which should ideally generate 36 bytes of additional WAL (4-byte per transaction id for each subtransaction).  Also, in other cases where you took data for DDL and DML, you have also used savepoints in those tests. I suggest for savepoints, let's do separate tests as you have done in case-13 but we can do it 3,5,7,10 savepoints and probably each transaction can update a row of 200 bytes or so.
>

Thanks Amit for reviewing results.

Yes, you are correct.  I used savepoints in DML so it was showing additional wal.

As suggested above, I did testing for DML's, DDL's and savepoints. Below is the test results:

Test results:

CREATE index operationsAdd col int(date) operationsAdd col text operations
SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN change
1
1 DDL without patch177280.89116
1.624548
9760.764393
11.475409
339040.80044
2.80792
with patch180160.80486810880.763602348560.787108
2
2 DDL without patch198720.860348
2.73752
16320.763199
13.7254902
345600.806086
3.078703
with patch204160.83906518560.733147356240.829281
3
3 DDL without patch220160.894891
3.63372093
22880.776871
14.685314
352160.803493
3.339391186
with patch228160.82802826240.737177363920.800194
4
4 DDL without patch241600.901686
4.4701986
29440.768445
15.217391
358720.77489
3.590544
with patch252400.88714333920.768382371600.82777
5
5 DDL without patch263280.901686
4.9832877
36000.751879
15.555555
365280.817928
3.832676
with patch276400.91407841600.74709379280.820621
6
6 DDL without patch284720.936385
5.5071649
42560.745179
15.78947368
371840.797043
4.066265
with patch300400.95822649280.725321386960.814535
7
8 DDL without patch327601.0022203
6.422466
55680.757468
16.091954
384960.83207
4.509559
with patch348640.96677764640.769072402320.903604
8
11 DDL without patch502961.0022203
5.662478
75360.748332
16.666666
404640.822266
5.179913
with patch531440.96677787920.750553425600.797133
9
15 DDL without patch588961.267253
5.662478
101840.776875
16.496465
431120.821916
5.84524
with patch627681.27234118640.746844456320.812567
10
1 DDL & 3 DML without patch182240.865753
1.58033362
11760.78074
9.523809
341040.857664
2.7914614
with patch185120.85478812880.767758350560.877604
11
3 DDL & 5 DML without patch236320.954274
3.385203
26320.785501
12.765957
355600.87744
3.3070866
with patch244320.92724529680.857528367360.867555
12
3 DDL & 10 DML without patch250880.941534
3.316326
30400.812123
11.052631
359680.877769
3.269579
with patch259200.89864333760.804943371440.879752
13
3 DDL & 15 DML without patch264000.949599
3.151515
33920.818491
9.90566037
363200.859353
3.2378854
with patch272320.89250537280.789752373200.812386
14
5 DDL & 15 DML without patch319040.994223
4.287863
47040.838091
11.904761
376320.867281
3.720238095
with patch332720.96812252640.816922390320.876364
15
1 DML without patch3280.817988
0






with patch3280.794927





16
3 DML without patch4640.791229
0






with patch4640.806211





17
5 DML without patch6080.794258
0






with patch6080.802001





18
10 DML without patch9680.831733
0






with patch9680.852777






Results for savepoints:
SN.Operation nameOperationLSN diff (in bytes)time (in sec)% LSN change
1
1 savepoint without patch
begin;
insert into perftest values (1);
savepoint s1;
update perftest set c1 = 5 where c1 = 1;
commit;
4080.805615
1.960784
with patch4160.823121
2
2 savepoint without patch
begin;
insert into perftest values (1);
savepoint s1;
update perftest set c1 = 5 where c1 = 1;
savepoint s2;
update perftest set c1 = 6 where c1 = 5;
commit;
4880.827147
3.278688
with patch5040.819165
3
3 savepoint without patch
begin;
insert into perftest values (1);
savepoint s1;
update perftest set c1 = 2 where c1 = 1;
savepoint s2;
update perftest set c1 = 3 where c1 = 2;
savepoint s3;
update perftest set c1 = 4 where c1 = 3;
commit;
5600.806441
4.28571428
with patch5840.821316
4
5 savepoint without patch
7120.823774
5.617977528
with patch7520.800037
5
7 savepoint without patch
8640.829136
6.48148148
with patch9200.793751
6
10 savepoint without patch
10960.77946
7.29927007
with patch11760.78711


To see all the operations(DDL's and DML's), please see test_results

Testing summary:
Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation.  I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically,  the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is  ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast
There is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal increment (basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some times if wal is already aligned, then we are getting 0 bytes increment)

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Apart from this one more fix in 0005,  basically, CheckLiveXid was
> never reset, so I have fixed that as well.
>

I have made a number of modifications in the 0001 patch and attached
is the result.  I have changed/added comments, done some cosmetic
cleanup, and ran pgindent.  The most notable change is to remove the
below code change:
DecodeXactOp()
{
..
- * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+ * However, it's critical to process records with subxid assignment even
  * when the snapshot is being built: it is possible to get later records
  * that require subxids to be properly assigned.
  */
  if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(XLogRecGetTopXid(r)))
..
}

I have not only removed the change done by the patch but the check
related to XLOG_XACT_ASSIGNMENT as well.  That check has been added by
commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even
if snapshot state is not SNAPBUILD_FULL_SNAPSHOT.  Now, with this
patch that is not required because we are making the subtransaction
and top-level transaction much earlier than this.  I have verified
that it doesn't reopen the bug by running the test provided in the
original report [1].

Let me know what you think of the changes?  If you find them okay,
then feel to include them in the next patch-set.

[1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Jun 4, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > > > we need to do some modifications so that we can use these apps without
> > > > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > > > SharedFileSet for each transaction and also need to maintain it in xid
> > > > > array or xid hash until transaction commit/abort.  So I suggest
> > > > > following modifications in shared files set so that we can
> > > > > conveniently use it.
> > > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > > > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > > > If fileset is NULL then in directory path we can use MyProcPID or
> > > > > something instead of fileset->creator_pid.
> > > > >
> > > >
> > > > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > > > better than the patch maintains sharedfileset information.
> > >
> > > I think we might do something better here, maybe by supplying function
> > > pointer or so,  but maintaining sharedfileset which contains different
> > > tablespace/mutext which we don't need at all for our purpose also
> > > doesn't sound very appealing.
> > >
> >
> > I think we can say something similar for Relation (rel cache entry as
> > well) maintained in LogicalRepRelMapEntry.  I think we only need a
> > pointer to that information.
>
> Yeah, I see.
>
> > >  Let me see if I can not come up with
> > > some clean way of avoiding the need to shared-fileset then maybe we
> > > can go with the shared fileset idea.
> > >
> >
> > Fair enough.
>
> While evaluating it further I feel there are a few more problems to
> solve if we are using BufFile,  First thing is that in subxact file we
> maintain the information of xid and its offset in the changes file.
> So now, we will also have to store 'fileno' but that we can find using
> BufFileTell.  Yet another problem is that currently, we don't
> have the truncate option in the BufFile,  but we need it if the
> sub-transaction gets aborted.  I think we can implement an extra
> interface with the BufFile and should not be very hard as we already
> know the fileno and the offset.  I will evaluate this part further and
> let you know about the same.

I have further evaluated this and also tested the concept with a POC
patch.  Soon I will complete and share, here is the scatch of the
idea.

As discussed we will use SharedBufFile for changes files and subxact
files.  There will be a separate LogicalStreamingResourceOwner, which
will be used to manage the VFD of the shared buf files.  We can create
a per stream resource owner i.e. on stream start we will create the
resource owner and all the shared buffiles will be opened under that
resource owner, which will be deleted on stream stop.   We need to
remember the SharedFileSet so that for subsequent stream for the same
transaction we can open the same file again, for this we will use a
hash table with xid as a key and in that, we will keep stream_fileset
and subxact_fileset's pointers as payload.

+typedef struct StreamXidHash
+{
+       TransactionId   xid;
+       SharedFileSet  *stream_fileset;
+       SharedFileSet  *subxact_fileset;
+} StreamXidHash;

We have to do some extension to the buffile modules, some of them are
already discussed up-thread but still listing them all down here
- A new interface BufFileTruncateShared(BufFile *file, int fileno,
off_t offset), for truncating the subtransaction changes, if changes
are spread across multiple files those files will be deleted and we
will adjust the file count and current offset accordingly in BufFile.
- In BufFileOpenShared, we will have to implement a mode so that we
can open in write mode as well, current only read-only mode supported.
- In SharedFileSetInit, if dsm_segment is NULL then we will not
register the file deletion on on_dsm_detach.
- As usual, we will clean up the files on stream abort/commit, or on
the worker exit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Apart from this one more fix in 0005,  basically, CheckLiveXid was
> > never reset, so I have fixed that as well.
> >
>
> I have made a number of modifications in the 0001 patch and attached
> is the result.  I have changed/added comments, done some cosmetic
> cleanup, and ran pgindent.  The most notable change is to remove the
> below code change:
> DecodeXactOp()
> {
> ..
> - * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
> + * However, it's critical to process records with subxid assignment even
>   * when the snapshot is being built: it is possible to get later records
>   * that require subxids to be properly assigned.
>   */
>   if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
> - info != XLOG_XACT_ASSIGNMENT)
> + !TransactionIdIsValid(XLogRecGetTopXid(r)))
> ..
> }
>
> I have not only removed the change done by the patch but the check
> related to XLOG_XACT_ASSIGNMENT as well.  That check has been added by
> commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even
> if snapshot state is not SNAPBUILD_FULL_SNAPSHOT.  Now, with this
> patch that is not required because we are making the subtransaction
> and top-level transaction much earlier than this.  I have verified
> that it doesn't reopen the bug by running the test provided in the
> original report [1].
>
> Let me know what you think of the changes?  If you find them okay,
> then feel to include them in the next patch-set.
>
> [1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com

Thanks for the patch, I will review it and include it in my next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Sun, Jun 7, 2020 at 5:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Let me know what you think of the changes?  If you find them okay,
> > then feel to include them in the next patch-set.
> >
> > [1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com
>
> Thanks for the patch, I will review it and include it in my next version.
>

Okay, I have done review of
0002-Issue-individual-invalidations-with-wal_level-lo.patch and below
are my comments:

1. I don't think it is a good idea that logical decoding process the
new XLOG_XACT_INVALIDATIONS and existing WAL records for invalidations
like XLOG_INVALIDATIONS and what we do in DecodeCommit (see code in
the check "if (parsed->nmsgs > 0)").  I think if that is required for
some particular reason then we should write detailed comments about
the same.  I have tried some experiments to see if those are really
required:
a. After applying patch 0002, I have tried by commenting out the
processing of invalidations via DecodeCommit and found some regression
tests were failing but the reason for failure was that we are not
setting RBTXN_HAS_CATALOG_CHANGES for the toptxn when subtxn has
catalog changes and when I did that all regression tests started
passing.  See the attached diff patch
(v27-0003-Incremental-patch-for-0002-to-test-removal-of-du) atop 0002
patch.
b. The processing of invalidations for XLOG_INVALIDATIONS is added by
commit c6ff84b06a for xid-less transactions.  See
https://postgr.es/m/CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
to know why that has been added.  Now, after this patch we will
process the same invalidations via XLOG_XACT_INVALIDATIONS and
XLOG_INVALIDATIONS which doesn't seem warranted.  Also, the below
assertion will fail for xid-less transactions (try create index
concurrently statement):
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));

I feel we don't need the processing of XLOG_INVALIDATIONS in logical
decoding after this patch but to prove that first we need to write a
test case which need XLOG_INVALIDATIONS in the HEAD as commit
c6ff84b06a doesn't add one.  I think we need two code paths in
XLOG_XACT_INVALIDATIONS where if it is for xid-less transactions, then
execute actions immediately as we are doing in processing of
XLOG_INVALIDATIONS, otherwise, do what we are doing currently in the
patch.  If the above point (b) is correct, I am not sure if it is a
good idea to use RM_XACT_ID as resource manager if for this WAL in
LogLogicalInvalidations, what do you think?

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn.  See the
comments in DecodeCommit where we call this function.  If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here.  We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

I had also thought a bit about removing logging of invalidations at
commit time altogether but it seems processing hot-standby is somewhat
tightly coupled with existing WAL logging.  See xact_redo_commit (a
comment atop call to ProcessCommittedInvalidationMessages).  It says
we need to maintain the order when we process invalidations.  If we
can later find a way to avoid that we can probably remove it but for
now maybe we can live with it.

2.
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALSMGR_ID)
+ appendStringInfoString(buf, " smgr");
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALRELMAP_ID)

I think the above comment is not valid after we started logging at CCI.

3.
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Here, it should check !ctx->forward as we do in DecodeCommit, do we
have any reason for not doing so.  We can test once by changing this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I think one of the usages we still need is in ReorderBufferForget
> because it can be called when we skip processing the txn.  See the
> comments in DecodeCommit where we call this function.  If I am
> correct, we need to probably collect all invalidations in
> ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> can do the same during processing of XLOG_XACT_INVALIDATIONS.
>

One more point related to this is that after this patch series, we
need to consider executing all invalidation during transaction abort.
Because it is possible that due to memory overflow, we have processed
some of the messages which also contain a few XACT_INVALIDATION
messages, so to avoid cache pollution, we need to execute all of them
in abort.  We also do the similar thing in Rollback/Rollback To
Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

Few other comments on
0002-Issue-individual-invalidations-with-wal_level-lo.patch
---------------------------------------------------------------------------------------------------------------
1.
+ if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+ {
+ ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+ MakeSharedInvalidMessagesArray);
+ invalMessages = SharedInvalidMessagesArray;
+ nmsgs  = numSharedInvalidMessagesArray;
+ SharedInvalidMessagesArray = NULL;
+ numSharedInvalidMessagesArray = 0;

a. Immediately after ProcessInvalidationMessagesMulti, isn't it better
to have an Assertion like Assert(!(numSharedInvalidMessagesArray > 0
&& SharedInvalidMessagesArray == NULL));?
b. Why check "if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)" is
required?  If you see xactGetCommittedInvalidationMessages where we do
something similar, we only check for valid value of transInvalInfo and
here we check the same in the caller of LogLogicalInvalidations, isn't
that sufficient?  If that is sufficient, we can either have the same
check here or have an Assert for the same.

2.
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
  if (transInvalInfo == NULL)
  return;

+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();
+
  ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
  LocalExecuteInvalidationMessage);
Generally, we WAL log the action after performing it but here you are
writing WAL first.  Is there any specific reason?  If so, can we write
a comment about the same?

3.
+ * When wal_level=logical, write invalidations into WAL at each command end to
+ * support the decoding of the in-progress transaction.  As of now it was
+ * enough to log invalidation only at commit because we are only decoding the
+ * transaction at the commit time.   We only need to log the catalog cache and
+ * relcache invalidation.  There can not be any active MVCC scan in logical
+ * decoding so we don't need to log the snapshot invalidation.

I think this comment doesn't hold good after we have changed the patch
to LOG invalidations at the time of CCI.

4.
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()

Add the function name atop of this function in comments to match the
style with other nearby functions.  How about modifying it to
something like: "Emit WAL for invalidations.  This is currently only
used for logging invalidations at the command end."

5.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */

I don't think we need to do anything about relcacheInitFileInval.
This is used to remove the stale files (RELCACHE_INIT_FILENAME) that
have obsolete information about relcache.  The walsender process that
is doing decoding doesn't require us to do anything about this.  Also,
if you see before this patch, we don't do anything about relcache
files during decoding of invalidation messages.  In short, I think we
can remove this comment unless you see some use of it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Jun 4, 2020 at 5:06 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:

To see all the operations(DDL's and DML's), please see test_results

Testing summary:
Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation.  I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically,  the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is  ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast
There is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal increment (basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some times if wal is already aligned, then we are getting 0 bytes increment)

So, if I read it correctly, there is no performance penalty with either of the patches but there is some additional WAL which in most cases is 2-5% but in worst cases and some specific DDL's it is upto 15%.  I think as this WAL overhead is when wal_level is logical, we might have to live with it as the other alternative is to blew up all caches on any DDL in WALSenders and that will have bot CPU and Network overhead as expalined previously [1].  I feel if the WAL overhead pinches any workload, we might want to do it under some new guc (which will disable streaming of transactions) but I don't think we need to go there.

What do you think? 


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 9, 2020 at 3:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 4, 2020 at 5:06 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>>
>> On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>>
>>
>> To see all the operations(DDL's and DML's), please see test_results
>>
>> Testing summary:
>> Basically, we are writing per command invalidation message and for testing that I have tested with different
combinationsof the DDL and DML operation.  I have not observed any performance degradation with the patch. For "create
index"DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for
"addcol text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%. 
>>
>> why are we seeing 11-13 % of the extra wall, basically,  the amount of extra WAL is not very high but the amount of
WALgenerated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column
textit is  ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast 
>> There is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal
increment(basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some
timesif wal is already aligned, then we are getting 0 bytes increment) 
>
>
> So, if I read it correctly, there is no performance penalty with either of the patches but there is some additional
WALwhich in most cases is 2-5% but in worst cases and some specific DDL's it is upto 15%.  I think as this WAL overhead
iswhen wal_level is logical, we might have to live with it as the other alternative is to blew up all caches on any DDL
inWALSenders and that will have bot CPU and Network overhead as expalined previously [1].  I feel if the WAL overhead
pinchesany workload, we might want to do it under some new guc (which will disable streaming of transactions) but I
don'tthink we need to go there. 
>
> What do you think?

Even I feel so because the WAL overhead is only with wal_level=logical
and especially with DDL and ideally, there should not be a large amount
of DDL in the system compared to other operations.  So I think we can live
with the current approach.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Sun, Jun 7, 2020 at 5:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jun 4, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > >
> > > > > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > > > > we need to do some modifications so that we can use these apps without
> > > > > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > > > > SharedFileSet for each transaction and also need to maintain it in xid
> > > > > > array or xid hash until transaction commit/abort.  So I suggest
> > > > > > following modifications in shared files set so that we can
> > > > > > conveniently use it.
> > > > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > > > > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > > > > If fileset is NULL then in directory path we can use MyProcPID or
> > > > > > something instead of fileset->creator_pid.
> > > > > >
> > > > >
> > > > > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > > > > better than the patch maintains sharedfileset information.
> > > >
> > > > I think we might do something better here, maybe by supplying function
> > > > pointer or so,  but maintaining sharedfileset which contains different
> > > > tablespace/mutext which we don't need at all for our purpose also
> > > > doesn't sound very appealing.
> > > >
> > >
> > > I think we can say something similar for Relation (rel cache entry as
> > > well) maintained in LogicalRepRelMapEntry.  I think we only need a
> > > pointer to that information.
> >
> > Yeah, I see.
> >
> > > >  Let me see if I can not come up with
> > > > some clean way of avoiding the need to shared-fileset then maybe we
> > > > can go with the shared fileset idea.
> > > >
> > >
> > > Fair enough.
> >
> > While evaluating it further I feel there are a few more problems to
> > solve if we are using BufFile,  First thing is that in subxact file we
> > maintain the information of xid and its offset in the changes file.
> > So now, we will also have to store 'fileno' but that we can find using
> > BufFileTell.  Yet another problem is that currently, we don't
> > have the truncate option in the BufFile,  but we need it if the
> > sub-transaction gets aborted.  I think we can implement an extra
> > interface with the BufFile and should not be very hard as we already
> > know the fileno and the offset.  I will evaluate this part further and
> > let you know about the same.
>
> I have further evaluated this and also tested the concept with a POC
> patch.  Soon I will complete and share, here is the scatch of the
> idea.
>
> As discussed we will use SharedBufFile for changes files and subxact
> files.  There will be a separate LogicalStreamingResourceOwner, which
> will be used to manage the VFD of the shared buf files.  We can create
> a per stream resource owner i.e. on stream start we will create the
> resource owner and all the shared buffiles will be opened under that
> resource owner, which will be deleted on stream stop.   We need to
> remember the SharedFileSet so that for subsequent stream for the same
> transaction we can open the same file again, for this we will use a
> hash table with xid as a key and in that, we will keep stream_fileset
> and subxact_fileset's pointers as payload.
>
> +typedef struct StreamXidHash
> +{
> +       TransactionId   xid;
> +       SharedFileSet  *stream_fileset;
> +       SharedFileSet  *subxact_fileset;
> +} StreamXidHash;
>
> We have to do some extension to the buffile modules, some of them are
> already discussed up-thread but still listing them all down here
> - A new interface BufFileTruncateShared(BufFile *file, int fileno,
> off_t offset), for truncating the subtransaction changes, if changes
> are spread across multiple files those files will be deleted and we
> will adjust the file count and current offset accordingly in BufFile.
> - In BufFileOpenShared, we will have to implement a mode so that we
> can open in write mode as well, current only read-only mode supported.
> - In SharedFileSetInit, if dsm_segment is NULL then we will not
> register the file deletion on on_dsm_detach.
> - As usual, we will clean up the files on stream abort/commit, or on
> the worker exit.

Currently, I am done with a working prototype of using the BufFile
infrastructure for the tempfile.  Meanwhile, I want to discuss a few
interface changes required for the BufFIle infrastructure.

1. Support read-write mode for "BufFileOpenShared",  Basically, in
workers we will be opening the xid's changes and subxact files per
stream, so we need an RW mode even in the open.  I have passed a flag
for the same.

2. Files should not be closed at the end of the transaction:
Currently, files opened with BufFileCreateShared/BufFileOpenShared are
registered to be closed on EOXACT.  Basically, we need to open the
changes file on the stream start and keep it open until stream stop,
so we can not afford to get it closed on the EOXACT.  I have added a
flag for the same.

3.  As. discussed above we need to support truncate for handling thee
subtransaction abort so I have added a new interface for the same.

4.  After every time we open the changes file, we need to seek to the
end, so I have supported SEEK_END.

Attached is the WIP patch for describing my changes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> Currently, I am done with a working prototype of using the BufFile
> infrastructure for the tempfile.  Meanwhile, I want to discuss a few
> interface changes required for the BufFIle infrastructure.
>
> 1. Support read-write mode for "BufFileOpenShared",  Basically, in
> workers we will be opening the xid's changes and subxact files per
> stream, so we need an RW mode even in the open.  I have passed a flag
> for the same.
>

Generally file open APIs have mode as a parameter to indicate
read_only or read_write.  Using flag here seems a bit odd to me.

> 2. Files should not be closed at the end of the transaction:
> Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> registered to be closed on EOXACT.  Basically, we need to open the
> changes file on the stream start and keep it open until stream stop,
> so we can not afford to get it closed on the EOXACT.  I have added a
> flag for the same.
>

But where do we end the transaction before the stream stop which can
lead to closure of this file?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > Currently, I am done with a working prototype of using the BufFile
> > infrastructure for the tempfile.  Meanwhile, I want to discuss a few
> > interface changes required for the BufFIle infrastructure.
> >
> > 1. Support read-write mode for "BufFileOpenShared",  Basically, in
> > workers we will be opening the xid's changes and subxact files per
> > stream, so we need an RW mode even in the open.  I have passed a flag
> > for the same.
> >
>
> Generally file open APIs have mode as a parameter to indicate
> read_only or read_write.  Using flag here seems a bit odd to me.

Let me think about it, we can try to pass the mode.

> > 2. Files should not be closed at the end of the transaction:
> > Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> > registered to be closed on EOXACT.  Basically, we need to open the
> > changes file on the stream start and keep it open until stream stop,
> > so we can not afford to get it closed on the EOXACT.  I have added a
> > flag for the same.
> >
>
> But where do we end the transaction before the stream stop which can
> lead to closure of this file?

Currently, I am keeping the transaction only while creating/opening
the files and closing immediately after that,  maybe we can keep the
transaction until stream stop, then we can avoid this changes,  and we
can also avoid creating extra resource owner?  What is your thought on
this?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Jun 10, 2020 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > 2. Files should not be closed at the end of the transaction:
> > > Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> > > registered to be closed on EOXACT.  Basically, we need to open the
> > > changes file on the stream start and keep it open until stream stop,
> > > so we can not afford to get it closed on the EOXACT.  I have added a
> > > flag for the same.
> > >
> >
> > But where do we end the transaction before the stream stop which can
> > lead to closure of this file?
>
> Currently, I am keeping the transaction only while creating/opening
> the files and closing immediately after that,  maybe we can keep the
> transaction until stream stop, then we can avoid this changes,  and we
> can also avoid creating extra resource owner?  What is your thought on
> this?
>

I would prefer to keep the transaction until the stream stop unless
there are good reasons for not doing so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jun 10, 2020 at 5:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 10, 2020 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > 2. Files should not be closed at the end of the transaction:
> > > > Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> > > > registered to be closed on EOXACT.  Basically, we need to open the
> > > > changes file on the stream start and keep it open until stream stop,
> > > > so we can not afford to get it closed on the EOXACT.  I have added a
> > > > flag for the same.
> > > >
> > >
> > > But where do we end the transaction before the stream stop which can
> > > lead to closure of this file?
> >
> > Currently, I am keeping the transaction only while creating/opening
> > the files and closing immediately after that,  maybe we can keep the
> > transaction until stream stop, then we can avoid this changes,  and we
> > can also avoid creating extra resource owner?  What is your thought on
> > this?
> >
>
> I would prefer to keep the transaction until the stream stop unless
> there are good reasons for not doing so.

I am ready with the first patch set which replaces the temp file usage
in the worker with the buffile usage. (patch v27-0013 and v27-0014)

Open item:
- As of now, I have kept the buffile changes and the worker using
buffile as separate patches for review.  Later I will make buffile
changes patch as a base patch and I will merge the worker changes with
the 0008 patch.

- Currently, while reading/writing the streaming/subxact files we are
reporting the wait event for example
'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);',  but
BufFileWrite/BufFileRead internally reports the read/write wait event.
So I think we can avoid reporting that?  Basically, this part is still
I have to work upon, once we get the consensus then I can remove those
extra wait event from the patch.

- There are still a few open comments, from your other mails, I still
have to work upon.  So I will work on those in the next version.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Jun 12, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> - Currently, while reading/writing the streaming/subxact files we are
> reporting the wait event for example
> 'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);',  but
> BufFileWrite/BufFileRead internally reports the read/write wait event.
> So I think we can avoid reporting that?
>

Yes, we can avoid that.  No other place using BufFileRead does any
such reporting.

>  Basically, this part is still
> I have to work upon, once we get the consensus then I can remove those
> extra wait event from the patch.
>

Okay, feel free to send an updated patch with the above change.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > - Currently, while reading/writing the streaming/subxact files we are
> > reporting the wait event for example
> > 'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);',  but
> > BufFileWrite/BufFileRead internally reports the read/write wait event.
> > So I think we can avoid reporting that?
> >
>
> Yes, we can avoid that.  No other place using BufFileRead does any
> such reporting.

I agree.

> >  Basically, this part is still
> > I have to work upon, once we get the consensus then I can remove those
> > extra wait event from the patch.
> >
>
> Okay, feel free to send an updated patch with the above change.

Sure, I will do that in the next patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jun 15, 2020 at 9:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > >  Basically, this part is still
> > > I have to work upon, once we get the consensus then I can remove those
> > > extra wait event from the patch.
> > >
> >
> > Okay, feel free to send an updated patch with the above change.
>
> Sure, I will do that in the next patch set.
>

I have few more comments on the patch
0013-Change-buffile-interface-required-for-streaming-.patch:

1.
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are read-only if the flag is set and are
+ * automatically closed at the end of the transaction but are not deleted on
+ * close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)

No need to say "are read-only if the flag is set".  I don't see any
flag passed to function so that part of the comment doesn't seem
appropriate.

2.
@@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  }

  /* Register our cleanup callback. */
- on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ if (seg)
+ on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }

Add comments atop function to explain when we don't want to register
the dsm detach stuff?

3.
+ */
+ newFile = file->numFiles - 1;
+ newOffset = FileSize(file->files[file->numFiles - 1]);
  break;

FileSize can return negative lengths to indicate failure which we
should handle.  See other places in the code where FileSize is used?
But I have another question here which is why we need to implement
SEEK_END?  How other usages of BufFile interface takes care of this?
I see an API BufFileTell which can give the current read/write
location in the file, isn't that sufficient for your usage?  Also, how
before BufFile usage is this thing handled in the patch?

4.
+ /* Loop over all the  files upto the fileno which we want to truncate. */
+ for (i = file->numFiles - 1; i >= fileno; i--)

"the  files", extra space in the above part of the comment.

5.
+ /*
+ * Except the fileno,  we can directly delete other files.

Before 'we', there is extra space.

6.
+ else
+ {
+ FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
+ newOffset = offset;
+ }

The wait event passed here doesn't seem to be appropriate.  You might
want to introduce a new wait event WAIT_EVENT_BUFFILE_TRUNCATE.  Also,
the error handling for FileTruncate is missing.

7.
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ SharedFileSetDelete(file->fileset, segment_name, true);
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }

Similar to the previous comment, I think we should handle the failure
of SharedFileSetDelete.

8. I think the comments related to BufFile shared API usage need to be
expanded in the code to explain the new usage.  For ex., see the below
comments atop buffile.c
* BufFile supports temporary files that can be made read-only and shared with
* other backends, as infrastructure for parallel execution.  Such files need
* to be created as a member of a SharedFileSet that all participants are
* attached to.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have few more comments on the patch
> 0013-Change-buffile-interface-required-for-streaming-.patch:
>

Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
1.
The subxact file is only create if there
+ * are any suxact info under this xid.
+ */
+typedef struct StreamXidHash

Lets slightly reword the part of the comment as "The subxact file is
created iff there is any suxact info under this xid."

2.
@@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
  subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
  stream_close_file();

+ /* Commit the per-stream transaction */
+ CommitTransactionCommand();

Before calling commit, ensure that we are in a valid transaction.  I
think we can have an Assert for IsTransactionState().

3.
@@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)

  int64 i;
  int64 subidx;
- int fd;
+ BufFile    *fd;
  bool found = false;
  char path[MAXPGPATH];
+ StreamXidHash *ent;

  subidx = -1;
+ ensure_transaction();
  subxact_info_read(MyLogicalRepWorker->subid, xid);

Why to call ensure_transaction here?  Is there any reason that we
won't have a valid transaction by now?  If not, then its better to
have an Assert for IsTransactionState().

4.
- if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+ if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
  {
- int save_errno = errno;
+ int save_errno = errno;

- CloseTransientFile(fd);
+ BufFileClose(fd);

On error, won't these files be close automatically?  If so, why at
this place and before other errors, we need to close this?

5.
if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not read file \"%s\": %m",

Can we change the error message to "could not read from streaming
transactions file .." or something like that and similarly we can
change the message for failure in reading changes file?

6.
if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not write to file \"%s\": %m",

Similar to previous, can we change it to "could not write to streaming
transactions file

7.
@@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid,
bool first_segment)
  * for writing, in append mode.
  */
  if (first_segment)
- flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
- else
- flags = (O_WRONLY | O_APPEND | PG_BINARY);
+ {
+ /*
+ * Shared fileset handle must be allocated in the persistent context.
+ */
+ SharedFileSet *fileset =
+ MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));

- stream_fd = OpenTransientFile(path, flags);
+ PrepareTempTablespaces();
+ SharedFileSetInit(fileset, NULL);

Why are we calling PrepareTempTablespaces here? It is already called
in SharedFileSetInit.

8.
+ /*
+ * Start a transaction on stream start, this transaction will be committed
+ * on the stream stop.  We need the transaction for handling the buffile,
+ * used for serializing the streaming data and subxact info.
+ */
+ ensure_transaction();

I think we need this for PrepareTempTablespaces to set the
temptablespaces.  Also, isn't it required for a cleanup of buffile
resources at the transaction end?  Are there any other reasons for it
as well?  The comment should be a bit more clear for why we need a
transaction here.

9.
* Open a file for streamed changes from a toplevel transaction identified
 * by stream_xid (global variable). If it's the first chunk of streamed
 * changes for this transaction, perform cleanup by removing existing
 * files after a possible previous crash.
..
stream_open_file(Oid subid, TransactionId xid, bool first_segment)

The above part comment atop stream_open_file needs to be changed after
new implementation.

10.
 * enabled.  This context is reeset on each stream stop.
*/
LogicalStreamingContext = AllocSetContextCreate(ApplyContext,

/reeset/reset

11.
stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
{
..
+ /* No entry created for this xid so simply return. */
+ if (ent == NULL)
+ return;
..
}

Is there any reason or scenario where this ent can be NULL?  If not,
it will be better to have an Assert for the same.

12.
subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * If there is no subtransaction then nothing to do,  but if already have
+ * subxact file then delete that.
+ */
+ if (nsubxacts == 0)
  {
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create file \"%s\": %m",
- path)));
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ ent->subxact_fileset = NULL;
..
}

Here don't we need to free the subxact_fileset before setting it to NULL?

13.
+ /*
+ * Scan complete hash and delete the underlying files for the the xids.
+ * Also delete the memory for the shared file sets.
+ */

/the the/the.  Instead of "delete the memory", it would be better to
say "release the memory".

14.
+ /*
+ * We might not have created the suxact fileset if there is no sub
+ * transaction.
+ */

/suxact/subxact

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think one of the usages we still need is in ReorderBufferForget
> > because it can be called when we skip processing the txn.  See the
> > comments in DecodeCommit where we call this function.  If I am
> > correct, we need to probably collect all invalidations in
> > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> >
>
> One more point related to this is that after this patch series, we
> need to consider executing all invalidation during transaction abort.
> Because it is possible that due to memory overflow, we have processed
> some of the messages which also contain a few XACT_INVALIDATION
> messages, so to avoid cache pollution, we need to execute all of them
> in abort.  We also do the similar thing in Rollback/Rollback To
> Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

I have analyzed this further and I think there is some problem with
that. If Instead of keeping the invalidation as an individual change,
if we try to combine them in ReorderBufferTxn's invalidation then what
happens if the (sub)transaction is aborted.  Basically, in this case,
we will end up executing all those invalidations for those we never
polluted the cache if we never try to stream it.  So this will affect
the normal case where we haven't streamed the transaction because
every time we have executed the invalidation logged by transaction
those are aborted.  One way is we develop the list at the
sub-transaction level and just before sending the transaction (on
commit) combine all the (sub) transaction's invalidation list.  But,
I think since we already have the invalidation in the commit message
then there is no point in adding this complexity.
But, my main worry is about the streaming transaction, the problems are
- Immediately on the arrival of individual invalidation, we can not
directly add to the top-level transaction's invalidation list because
later if the transaction aborted before we stream (or we directly
stream on commit) then we will get an unnecessarily long list of
invalidation which is done by aborted subtransaction.
- If we keep collecting in the individual subtransaction's
ReorderBufferTxn->invalidations,  then the problem is when to merge
it?  I think it is a good idea to merge them all as soon as we try to
stream it/or on commit?  So since this solution of combining the (sub)
transaction's invalidation is required for the streaming case we can
use it as common solution whether it streams due to the memory
overflow or due to the commit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > I think one of the usages we still need is in ReorderBufferForget
> > > because it can be called when we skip processing the txn.  See the
> > > comments in DecodeCommit where we call this function.  If I am
> > > correct, we need to probably collect all invalidations in
> > > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> > >
> >
> > One more point related to this is that after this patch series, we
> > need to consider executing all invalidation during transaction abort.
> > Because it is possible that due to memory overflow, we have processed
> > some of the messages which also contain a few XACT_INVALIDATION
> > messages, so to avoid cache pollution, we need to execute all of them
> > in abort.  We also do the similar thing in Rollback/Rollback To
> > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.
>
> I have analyzed this further and I think there is some problem with
> that. If Instead of keeping the invalidation as an individual change,
> if we try to combine them in ReorderBufferTxn's invalidation then what
> happens if the (sub)transaction is aborted.  Basically, in this case,
> we will end up executing all those invalidations for those we never
> polluted the cache if we never try to stream it.  So this will affect
> the normal case where we haven't streamed the transaction because
> every time we have executed the invalidation logged by transaction
> those are aborted.  One way is we develop the list at the
> sub-transaction level and just before sending the transaction (on
> commit) combine all the (sub) transaction's invalidation list.  But,
> I think since we already have the invalidation in the commit message
> then there is no point in adding this complexity.
> But, my main worry is about the streaming transaction, the problems are
> - Immediately on the arrival of individual invalidation, we can not
> directly add to the top-level transaction's invalidation list because
> later if the transaction aborted before we stream (or we directly
> stream on commit) then we will get an unnecessarily long list of
> invalidation which is done by aborted subtransaction.
>

Is there any problem you see with this or you are concerned with the
efficiency?  Please note, we already do something similar in
ReorderBufferForget and if your concern is efficiency then that
applies to existing cases as well.  I think if we want we can improve
it later in many ways and one of them you have already suggested, at
this time, the main thing is correctness and also aborts are not
frequent enough to worry too much about their performance.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jun 17, 2020 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > I think one of the usages we still need is in ReorderBufferForget
> > > > because it can be called when we skip processing the txn.  See the
> > > > comments in DecodeCommit where we call this function.  If I am
> > > > correct, we need to probably collect all invalidations in
> > > > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > > > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> > > >
> > >
> > > One more point related to this is that after this patch series, we
> > > need to consider executing all invalidation during transaction abort.
> > > Because it is possible that due to memory overflow, we have processed
> > > some of the messages which also contain a few XACT_INVALIDATION
> > > messages, so to avoid cache pollution, we need to execute all of them
> > > in abort.  We also do the similar thing in Rollback/Rollback To
> > > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.
> >
> > I have analyzed this further and I think there is some problem with
> > that. If Instead of keeping the invalidation as an individual change,
> > if we try to combine them in ReorderBufferTxn's invalidation then what
> > happens if the (sub)transaction is aborted.  Basically, in this case,
> > we will end up executing all those invalidations for those we never
> > polluted the cache if we never try to stream it.  So this will affect
> > the normal case where we haven't streamed the transaction because
> > every time we have executed the invalidation logged by transaction
> > those are aborted.  One way is we develop the list at the
> > sub-transaction level and just before sending the transaction (on
> > commit) combine all the (sub) transaction's invalidation list.  But,
> > I think since we already have the invalidation in the commit message
> > then there is no point in adding this complexity.
> > But, my main worry is about the streaming transaction, the problems are
> > - Immediately on the arrival of individual invalidation, we can not
> > directly add to the top-level transaction's invalidation list because
> > later if the transaction aborted before we stream (or we directly
> > stream on commit) then we will get an unnecessarily long list of
> > invalidation which is done by aborted subtransaction.
> >
>
> Is there any problem you see with this or you are concerned with the
> efficiency?  Please note, we already do something similar in
> ReorderBufferForget and if your concern is efficiency then that
> applies to existing cases as well.  I think if we want we can improve
> it later in many ways and one of them you have already suggested, at
> this time, the main thing is correctness and also aborts are not
> frequent enough to worry too much about their performance.

As of now, I am not seeing the problem, I was just concerned about
processing more invalidation messages in the aborted cases compared to
current code, even if the streaming is off/ or transaction never
streamed as memory size is not crossed.  But, I agree that it is only
in the case of the abort, so I will work on this and later maybe we
can test the performance.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Jun 16, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have few more comments on the patch
> > 0013-Change-buffile-interface-required-for-streaming-.patch:
> >
>
> Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
>

changes_filename(char *path, Oid subid, TransactionId xid)
 {
- char tempdirpath[MAXPGPATH];
-
- TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
- /*
- * We might need to create the tablespace's tempfile directory, if no
- * one has yet done so.
- */
- if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create directory \"%s\": %m",
- tempdirpath)));
-
- snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
- tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+ snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);

Today, I was studying this change and its impact.  Initially, I
thought that because the patch has removed pgsql_tmp prefix from the
filename, it might create problems if the temporary files remain on
the disk after the crash.  Now as the patch has started using BufFile
interface, it seems to be internally taking care of the same by
generating names like
"base/pgsql_tmp/pgsql_tmp13774.0.sharedfileset/16393-513.changes.0".
Basically, it ensures to create the file in the directory starting
with pgsql_tmp.  I have tried by crashing the server in a situation
where the temp files remain and after the restart, they are removed.
So, it seems okay to generate file names like that but I still suggest
testing other paths like backup where we ignore files whose names
start with PG_TEMP_FILE_PREFIX.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jun 7, 2020 at 5:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Let me know what you think of the changes?  If you find them okay,
> > > then feel to include them in the next patch-set.
> > >
> > > [1] -
https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com
> >
> > Thanks for the patch, I will review it and include it in my next version.

I have merged your changes 0002 in this version.

> Okay, I have done review of
> 0002-Issue-individual-invalidations-with-wal_level-lo.patch and below
> are my comments:
>
> 1. I don't think it is a good idea that logical decoding process the
> new XLOG_XACT_INVALIDATIONS and existing WAL records for invalidations
> like XLOG_INVALIDATIONS and what we do in DecodeCommit (see code in
> the check "if (parsed->nmsgs > 0)").  I think if that is required for
> some particular reason then we should write detailed comments about
> the same.  I have tried some experiments to see if those are really
> required:
> a. After applying patch 0002, I have tried by commenting out the
> processing of invalidations via DecodeCommit and found some regression
> tests were failing but the reason for failure was that we are not
> setting RBTXN_HAS_CATALOG_CHANGES for the toptxn when subtxn has
> catalog changes and when I did that all regression tests started
> passing.  See the attached diff patch
> (v27-0003-Incremental-patch-for-0002-to-test-removal-of-du) atop 0002
> patch.
> b. The processing of invalidations for XLOG_INVALIDATIONS is added by
> commit c6ff84b06a for xid-less transactions.  See
> https://postgr.es/m/CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
> to know why that has been added.  Now, after this patch we will
> process the same invalidations via XLOG_XACT_INVALIDATIONS and
> XLOG_INVALIDATIONS which doesn't seem warranted.  Also, the below
> assertion will fail for xid-less transactions (try create index
> concurrently statement):
> + case XLOG_XACT_INVALIDATIONS:
> + {
> + TransactionId xid;
> + xl_xact_invalidations *invals;
> +
> + xid = XLogRecGetXid(r);
> + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> +
> + Assert(TransactionIdIsValid(xid));
>
> I feel we don't need the processing of XLOG_INVALIDATIONS in logical
> decoding after this patch but to prove that first we need to write a
> test case which need XLOG_INVALIDATIONS in the HEAD as commit
> c6ff84b06a doesn't add one.  I think we need two code paths in
> XLOG_XACT_INVALIDATIONS where if it is for xid-less transactions, then
> execute actions immediately as we are doing in processing of
> XLOG_INVALIDATIONS, otherwise, do what we are doing currently in the
> patch.  If the above point (b) is correct, I am not sure if it is a
> good idea to use RM_XACT_ID as resource manager if for this WAL in
> LogLogicalInvalidations, what do you think?
>
> I think one of the usages we still need is in ReorderBufferForget
> because it can be called when we skip processing the txn.  See the
> comments in DecodeCommit where we call this function.  If I am
> correct, we need to probably collect all invalidations in
> ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> can do the same during processing of XLOG_XACT_INVALIDATIONS.
>
> I had also thought a bit about removing logging of invalidations at
> commit time altogether but it seems processing hot-standby is somewhat
> tightly coupled with existing WAL logging.  See xact_redo_commit (a
> comment atop call to ProcessCommittedInvalidationMessages).  It says
> we need to maintain the order when we process invalidations.  If we
> can later find a way to avoid that we can probably remove it but for
> now maybe we can live with it.

Yes, I have made the changes.  Basically, now I am only using the
XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
are directly appending it to the txn->invalidations.  I have tested
the XLOG_INVALIDATIONS part but while sending this mail I realized
that we could write some automated test for the same.  I will work on
that soon.

> 2.
> + /* not expected, but print something anyway */
> + else if (msg->id == SHAREDINVALSMGR_ID)
> + appendStringInfoString(buf, " smgr");
> + /* not expected, but print something anyway */
> + else if (msg->id == SHAREDINVALRELMAP_ID)
>
> I think the above comment is not valid after we started logging at CCI.

Yup, fixed.

> 3.
> +
> + xid = XLogRecGetXid(r);
> + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> +
> + Assert(TransactionIdIsValid(xid));
> + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> + invals->nmsgs, invals->msgs);
>
> Here, it should check !ctx->forward as we do in DecodeCommit, do we
> have any reason for not doing so.  We can test once by changing this.

Yeah, it should have this check.

Mostly it contains changes in 0002,  apart from that we needed some
changes in 0005,0006 to rebase on 0002 and also there is one bug fix
in 0005, basically the txn->snapshot_now was not getting set to NULL
after freeing so it was getting double free.  I have also removed the
extra wait even from the 0014 as BufFile is already logging the wait
event internally and also some changes because BufFileWrite interface
is changed in recent commits.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think one of the usages we still need is in ReorderBufferForget
> > because it can be called when we skip processing the txn.  See the
> > comments in DecodeCommit where we call this function.  If I am
> > correct, we need to probably collect all invalidations in
> > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> >
>
> One more point related to this is that after this patch series, we
> need to consider executing all invalidation during transaction abort.
> Because it is possible that due to memory overflow, we have processed
> some of the messages which also contain a few XACT_INVALIDATION
> messages, so to avoid cache pollution, we need to execute all of them
> in abort.  We also do the similar thing in Rollback/Rollback To
> Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

Yes, we need to do that,  So now we are collecting all the
invalidation under txn->invalidation so they are getting executed on
abort.

>
> Few other comments on
> 0002-Issue-individual-invalidations-with-wal_level-lo.patch
> ---------------------------------------------------------------------------------------------------------------
> 1.
> + if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
> + {
> + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
> + MakeSharedInvalidMessagesArray);
> + invalMessages = SharedInvalidMessagesArray;
> + nmsgs  = numSharedInvalidMessagesArray;
> + SharedInvalidMessagesArray = NULL;
> + numSharedInvalidMessagesArray = 0;
>
> a. Immediately after ProcessInvalidationMessagesMulti, isn't it better
> to have an Assertion like Assert(!(numSharedInvalidMessagesArray > 0
> && SharedInvalidMessagesArray == NULL));?

Done

> b. Why check "if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)" is
> required?  If you see xactGetCommittedInvalidationMessages where we do
> something similar, we only check for valid value of transInvalInfo and
> here we check the same in the caller of LogLogicalInvalidations, isn't
> that sufficient?  If that is sufficient, we can either have the same
> check here or have an Assert for the same.

I have put the same check here.

>
> 2.
> @@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
>   if (transInvalInfo == NULL)
>   return;
>
> + if (XLogLogicalInfoActive())
> + LogLogicalInvalidations();
> +
>   ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
>   LocalExecuteInvalidationMessage);
> Generally, we WAL log the action after performing it but here you are
> writing WAL first.  Is there any specific reason?  If so, can we write
> a comment about the same?

Yeah, there is no reason for the same so moved it down.

>
> 3.
> + * When wal_level=logical, write invalidations into WAL at each command end to
> + * support the decoding of the in-progress transaction.  As of now it was
> + * enough to log invalidation only at commit because we are only decoding the
> + * transaction at the commit time.   We only need to log the catalog cache and
> + * relcache invalidation.  There can not be any active MVCC scan in logical
> + * decoding so we don't need to log the snapshot invalidation.
>
> I think this comment doesn't hold good after we have changed the patch
> to LOG invalidations at the time of CCI.

Right, modified.

>
> 4.
> +
> +/*
> + * Emit WAL for invalidations.
> + */
> +static void
> +LogLogicalInvalidations()
>
> Add the function name atop of this function in comments to match the
> style with other nearby functions.  How about modifying it to
> something like: "Emit WAL for invalidations.  This is currently only
> used for logging invalidations at the command end."

Done

>
> 5.
> + *
> + * XXX Do we need to care about relcacheInitFileInval and
> + * the other fields added to ReorderBufferChange, or just
> + * about the message itself?
> + */
>
> I don't think we need to do anything about relcacheInitFileInval.
> This is used to remove the stale files (RELCACHE_INIT_FILENAME) that
> have obsolete information about relcache.  The walsender process that
> is doing decoding doesn't require us to do anything about this.  Also,
> if you see before this patch, we don't do anything about relcache
> files during decoding of invalidation messages.  In short, I think we
> can remove this comment unless you see some use of it.

Now, we have removed the Invalidation change itself so this comment is gone.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jun 15, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 9:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > >  Basically, this part is still
> > > > I have to work upon, once we get the consensus then I can remove those
> > > > extra wait event from the patch.
> > > >
> > >
> > > Okay, feel free to send an updated patch with the above change.
> >
> > Sure, I will do that in the next patch set.
> >
>
> I have few more comments on the patch
> 0013-Change-buffile-interface-required-for-streaming-.patch:
>
> 1.
> - * temp_file_limit of the caller, are read-only and are automatically closed
> - * at the end of the transaction but are not deleted on close.
> + * temp_file_limit of the caller, are read-only if the flag is set and are
> + * automatically closed at the end of the transaction but are not deleted on
> + * close.
>   */
>  File
> -PathNameOpenTemporaryFile(const char *path)
> +PathNameOpenTemporaryFile(const char *path, int mode)
>
> No need to say "are read-only if the flag is set".  I don't see any
> flag passed to function so that part of the comment doesn't seem
> appropriate.

Done

> 2.
> @@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
>   }
>
>   /* Register our cleanup callback. */
> - on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> + if (seg)
> + on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
>  }
>
> Add comments atop function to explain when we don't want to register
> the dsm detach stuff?

Done,  I am planning to work on more cleaner function for on_proc_exit
as we discussed offlist.  I will work on this in the next version.

> 3.
> + */
> + newFile = file->numFiles - 1;
> + newOffset = FileSize(file->files[file->numFiles - 1]);
>   break;
>
> FileSize can return negative lengths to indicate failure which we
> should handle.

Done

  See other places in the code where FileSize is used?
> But I have another question here which is why we need to implement
> SEEK_END?  How other usages of BufFile interface takes care of this?
> I see an API BufFileTell which can give the current read/write
> location in the file, isn't that sufficient for your usage?  Also, how
> before BufFile usage is this thing handled in the patch?

So far we never supported to open the file in write mode,  only we
create in write mode.  So if we have created the file and its open we
can always use BufFileTell, which will tell the current end location
of the file.  But, once we close and open again it always set to read
from the start of the file as per the current use case.  We need a way
to jump to the end of the last file for appending it.

> 4.
> + /* Loop over all the  files upto the fileno which we want to truncate. */
> + for (i = file->numFiles - 1; i >= fileno; i--)
>
> "the  files", extra space in the above part of the comment.

Fixed

> 5.
> + /*
> + * Except the fileno,  we can directly delete other files.
>
> Before 'we', there is extra space.

Done.

> 6.
> + else
> + {
> + FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
> + newOffset = offset;
> + }
>
> The wait event passed here doesn't seem to be appropriate.  You might
> want to introduce a new wait event WAIT_EVENT_BUFFILE_TRUNCATE.  Also,
> the error handling for FileTruncate is missing.

Done

> 7.
> + if ((i != fileno || offset == 0) && fileno != 0)
> + {
> + SharedSegmentName(segment_name, file->name, i);
> + SharedFileSetDelete(file->fileset, segment_name, true);
> + newFile--;
> + newOffset = MAX_PHYSICAL_FILESIZE;
> + }
>
> Similar to the previous comment, I think we should handle the failure
> of SharedFileSetDelete.
>
> 8. I think the comments related to BufFile shared API usage need to be
> expanded in the code to explain the new usage.  For ex., see the below
> comments atop buffile.c
> * BufFile supports temporary files that can be made read-only and shared with
> * other backends, as infrastructure for parallel execution.  Such files need
> * to be created as a member of a SharedFileSet that all participants are
> * attached to.

Other fixes (offlist raised by my colleague Neha Sharma)
1. In BufFileTruncateShared, the files were not closed before
deleting.  (in 0013)
2. In apply_handle_stream_commit, the file name in debug message was
printed before populating the name (0014)
3. On concurrent abort we are truncating all the changes including
some incomplete changes,  so later when we get the complete changes we
don't have the previous changes,  e.g, if we had specinsert in the
last stream and due to concurrent abort detection if we delete that
changes later we will get spec_confirm without spec insert.  We could
have simply avoided deleting all the changes, but I think the better
fix is once we detect the concurrent abort for any transaction, then
why do we need to collect the changes for that, we can simply avoid
that.  So I have put that fix. (0006)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have few more comments on the patch
> > 0013-Change-buffile-interface-required-for-streaming-.patch:
> >
>
> Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
> 1.
> The subxact file is only create if there
> + * are any suxact info under this xid.
> + */
> +typedef struct StreamXidHash
>
> Lets slightly reword the part of the comment as "The subxact file is
> created iff there is any suxact info under this xid."

Done

>
> 2.
> @@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
>   subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
>   stream_close_file();
>
> + /* Commit the per-stream transaction */
> + CommitTransactionCommand();
>
> Before calling commit, ensure that we are in a valid transaction.  I
> think we can have an Assert for IsTransactionState().

Done

> 3.
> @@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)
>
>   int64 i;
>   int64 subidx;
> - int fd;
> + BufFile    *fd;
>   bool found = false;
>   char path[MAXPGPATH];
> + StreamXidHash *ent;
>
>   subidx = -1;
> + ensure_transaction();
>   subxact_info_read(MyLogicalRepWorker->subid, xid);
>
> Why to call ensure_transaction here?  Is there any reason that we
> won't have a valid transaction by now?  If not, then its better to
> have an Assert for IsTransactionState().

We are only starting transaction from stream_start to stream_stop,  so
at stream_abort we will not have the transaction.

> 4.
> - if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
> + if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
>   {
> - int save_errno = errno;
> + int save_errno = errno;
>
> - CloseTransientFile(fd);
> + BufFileClose(fd);
>
> On error, won't these files be close automatically?  If so, why at
> this place and before other errors, we need to close this?

Yes, that's correct.  I have fixed those.

> 5.
> if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
> {
> int save_errno = errno;
>
> BufFileClose(fd);
> errno = save_errno;
> ereport(ERROR,
> (errcode_for_file_access(),
> errmsg("could not read file \"%s\": %m",
>
> Can we change the error message to "could not read from streaming
> transactions file .." or something like that and similarly we can
> change the message for failure in reading changes file?

Done


> 6.
> if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
> {
> int save_errno = errno;
>
> BufFileClose(fd);
> errno = save_errno;
> ereport(ERROR,
> (errcode_for_file_access(),
> errmsg("could not write to file \"%s\": %m",
>
> Similar to previous, can we change it to "could not write to streaming
> transactions file

BufFileWrite is not returning failure anymore.

> 7.
> @@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid,
> bool first_segment)
>   * for writing, in append mode.
>   */
>   if (first_segment)
> - flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
> - else
> - flags = (O_WRONLY | O_APPEND | PG_BINARY);
> + {
> + /*
> + * Shared fileset handle must be allocated in the persistent context.
> + */
> + SharedFileSet *fileset =
> + MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
>
> - stream_fd = OpenTransientFile(path, flags);
> + PrepareTempTablespaces();
> + SharedFileSetInit(fileset, NULL);
>
> Why are we calling PrepareTempTablespaces here? It is already called
> in SharedFileSetInit.

My bad,  First I tired using SharedFileSetInit but later it got
changed for forgot to remove this part.

> 8.
> + /*
> + * Start a transaction on stream start, this transaction will be committed
> + * on the stream stop.  We need the transaction for handling the buffile,
> + * used for serializing the streaming data and subxact info.
> + */
> + ensure_transaction();
>
> I think we need this for PrepareTempTablespaces to set the
> temptablespaces.  Also, isn't it required for a cleanup of buffile
> resources at the transaction end?  Are there any other reasons for it
> as well?  The comment should be a bit more clear for why we need a
> transaction here.

I am not sure that will it make sense to add a comment here that why
buffile and sharedfileset need a transaction?  Do you think that we
should add comment in buffile/shared fileset API that it should be
called under a transaction?

> 9.
> * Open a file for streamed changes from a toplevel transaction identified
>  * by stream_xid (global variable). If it's the first chunk of streamed
>  * changes for this transaction, perform cleanup by removing existing
>  * files after a possible previous crash.
> ..
> stream_open_file(Oid subid, TransactionId xid, bool first_segment)
>
> The above part comment atop stream_open_file needs to be changed after
> new implementation.

Done

> 10.
>  * enabled.  This context is reeset on each stream stop.
> */
> LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
>
> /reeset/reset

Done


> 11.
> stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
> {
> ..
> + /* No entry created for this xid so simply return. */
> + if (ent == NULL)
> + return;
> ..
> }
>
> Is there any reason or scenario where this ent can be NULL?  If not,
> it will be better to have an Assert for the same.

Right, it should be an assert, even if all the changes are ignored for
the top transaction, we should have sent the stream_start.

> 12.
> subxact_info_write(Oid subid, TransactionId xid)
> {
> ..
> + /*
> + * If there is no subtransaction then nothing to do,  but if already have
> + * subxact file then delete that.
> + */
> + if (nsubxacts == 0)
>   {
> - ereport(ERROR,
> - (errcode_for_file_access(),
> - errmsg("could not create file \"%s\": %m",
> - path)));
> + if (ent->subxact_fileset)
> + {
> + cleanup_subxact_info();
> + BufFileDeleteShared(ent->subxact_fileset, path);
> + ent->subxact_fileset = NULL;
> ..
> }
>
> Here don't we need to free the subxact_fileset before setting it to NULL?

Yes, done

> 13.
> + /*
> + * Scan complete hash and delete the underlying files for the the xids.
> + * Also delete the memory for the shared file sets.
> + */
>
> /the the/the.  Instead of "delete the memory", it would be better to
> say "release the memory".

Done

>
> 14.
> + /*
> + * We might not have created the suxact fileset if there is no sub
> + * transaction.
> + */
>
> /suxact/subxact
Done




--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Yes, I have made the changes.  Basically, now I am only using the
> XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
> So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
> are directly appending it to the txn->invalidations.  I have tested
> the XLOG_INVALIDATIONS part but while sending this mail I realized
> that we could write some automated test for the same.
>

Can you share how you have tested it?

>  I will work on
> that soon.
>

Cool, I think having a regression test for this will be a good idea.

@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
  if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
  ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
     txn->invalidations);
- else
- Assert(txn->ninvalidations == 0);

Why this Assert is removed?

Apart from above, I have made a number of changes in
0002-WAL-Log-invalidations-at-command-end-with-wal_le to remove some
unnecessary changes, edited comments, ran pgindent and updated the
commit message.  If you are fine with these changes, then do include
them in your next version.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Yes, I have made the changes.  Basically, now I am only using the
> > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
> > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
> > are directly appending it to the txn->invalidations.  I have tested
> > the XLOG_INVALIDATIONS part but while sending this mail I realized
> > that we could write some automated test for the same.
> >
>
> Can you share how you have tested it?
>
> >  I will work on
> > that soon.
> >
>
> Cool, I think having a regression test for this will be a good idea.
>

Other than above tests, can we somehow verify that the invalidations
generated at commit time are the same as what we do with this patch?
We have verified with individual commands but it would be great if we
can verify for the regression tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Yes, I have made the changes.  Basically, now I am only using the
> > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
> > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
> > are directly appending it to the txn->invalidations.  I have tested
> > the XLOG_INVALIDATIONS part but while sending this mail I realized
> > that we could write some automated test for the same.
> >
>
> Can you share how you have tested it?

I just ran create index concurrently and decoded the changes.

> >  I will work on
> > that soon.
> >
>
> Cool, I think having a regression test for this will be a good idea.

ok

> @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
> TransactionId xid, XLogRecPtr lsn)
>   if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
>   ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
>      txn->invalidations);
> - else
> - Assert(txn->ninvalidations == 0);
>
> Why this Assert is removed?

Even if the base_snapshot is NULL, now we are collecting the
txn->invalidation.  However,  we haven't done any activity for that
transaction so we don't need to execute the invalidations same as the
code before, but assert is no more valid.

> Apart from above, I have made a number of changes in
> 0002-WAL-Log-invalidations-at-command-end-with-wal_le to remove some
> unnecessary changes, edited comments, ran pgindent and updated the
> commit message.  If you are fine with these changes, then do include
> them in your next version.

Thanks, I will check those.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jun 22, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Yes, I have made the changes.  Basically, now I am only using the
> > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
> > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
> > > are directly appending it to the txn->invalidations.  I have tested
> > > the XLOG_INVALIDATIONS part but while sending this mail I realized
> > > that we could write some automated test for the same.
> > >
> >
> > Can you share how you have tested it?
>
> I just ran create index concurrently and decoded the changes.
>

Hmm, I think that won't reproduce the exact problem.  What I wanted
was to run another command after "create index concurrently" which
depends on that and see if the decoding fails by removing the
XLOG_INVALIDATIONS code.  Once you get some failure, you can apply the
0002 patch and see if the test is passed?

>
> > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
> > TransactionId xid, XLogRecPtr lsn)
> >   if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
> >   ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
> >      txn->invalidations);
> > - else
> > - Assert(txn->ninvalidations == 0);
> >
> > Why this Assert is removed?
>
> Even if the base_snapshot is NULL, now we are collecting the
> txn->invalidation.
>

But there doesn't seem to be any check even before this patch which
directly prohibits accumulating invalidations in DecodeCommit.  We
have check for base_snapshot in ReorderBufferCommit.  Did you get any
failure with that check?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > Yes, I have made the changes.  Basically, now I am only using the
> > > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
> > > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
> > > > are directly appending it to the txn->invalidations.  I have tested
> > > > the XLOG_INVALIDATIONS part but while sending this mail I realized
> > > > that we could write some automated test for the same.
> > > >
> > >
> > > Can you share how you have tested it?
> >
> > I just ran create index concurrently and decoded the changes.
> >
>
> Hmm, I think that won't reproduce the exact problem.  What I wanted
> was to run another command after "create index concurrently" which
> depends on that and see if the decoding fails by removing the
> XLOG_INVALIDATIONS code.  Once you get some failure, you can apply the
> 0002 patch and see if the test is passed?

Okay, I will test that.

> >
> > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
> > > TransactionId xid, XLogRecPtr lsn)
> > >   if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
> > >   ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
> > >      txn->invalidations);
> > > - else
> > > - Assert(txn->ninvalidations == 0);
> > >
> > > Why this Assert is removed?
> >
> > Even if the base_snapshot is NULL, now we are collecting the
> > txn->invalidation.
> >
>
> But there doesn't seem to be any check even before this patch which
> directly prohibits accumulating invalidations in DecodeCommit.  We
> have check for base_snapshot in ReorderBufferCommit.  Did you get any
> failure with that check?

Because earlier ReorderBufferForget for toptxn will be called if the
top transaction is aborted and in abort case, we are not logging any
invalidation so that will be 0.  However same is not true now.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
> > > > TransactionId xid, XLogRecPtr lsn)
> > > >   if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
> > > >   ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
> > > >      txn->invalidations);
> > > > - else
> > > > - Assert(txn->ninvalidations == 0);
> > > >
> > > > Why this Assert is removed?
> > >
> > > Even if the base_snapshot is NULL, now we are collecting the
> > > txn->invalidation.
> > >
> >
> > But there doesn't seem to be any check even before this patch which
> > directly prohibits accumulating invalidations in DecodeCommit.  We
> > have check for base_snapshot in ReorderBufferCommit.  Did you get any
> > failure with that check?
>
> Because earlier ReorderBufferForget for toptxn will be called if the
> top transaction is aborted and in abort case, we are not logging any
> invalidation so that will be 0.  However same is not true now.
>

AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when
we need to skip the transaction.  It doesn't seem to be called from
Abort path (DecodeAbort/ReorderBufferAbort doesn't use
ReorderBufferForget).  I am not sure which code path are you referring
here, can you please share the code flow which you are referring to
here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 23, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
> > > > > TransactionId xid, XLogRecPtr lsn)
> > > > >   if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
> > > > >   ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
> > > > >      txn->invalidations);
> > > > > - else
> > > > > - Assert(txn->ninvalidations == 0);
> > > > >
> > > > > Why this Assert is removed?
> > > >
> > > > Even if the base_snapshot is NULL, now we are collecting the
> > > > txn->invalidation.
> > > >
> > >
> > > But there doesn't seem to be any check even before this patch which
> > > directly prohibits accumulating invalidations in DecodeCommit.  We
> > > have check for base_snapshot in ReorderBufferCommit.  Did you get any
> > > failure with that check?
> >
> > Because earlier ReorderBufferForget for toptxn will be called if the
> > top transaction is aborted and in abort case, we are not logging any
> > invalidation so that will be 0.  However same is not true now.
> >
>
> AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when
> we need to skip the transaction.  It doesn't seem to be called from
> Abort path (DecodeAbort/ReorderBufferAbort doesn't use
> ReorderBufferForget).  I am not sure which code path are you referring
> here, can you please share the code flow which you are referring to
> here.

I think you are right,  during some intermediate code change, it
crashed on that assert (I guess I might be adding invalidation to the
sub-transaction but not sure what was that state) and I assumed that
is the reason that I explained above but, now I see my assumption was
wrong.  I will put back that assert.  By testing, I could not hit any
case where we hit that assert even after my changes, still I will put
more thought if by any chance our case is different then the base
code.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 23, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
> > > > > > TransactionId xid, XLogRecPtr lsn)
> > > > > >   if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
> > > > > >   ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
> > > > > >      txn->invalidations);
> > > > > > - else
> > > > > > - Assert(txn->ninvalidations == 0);
> > > > > >
> > > > > > Why this Assert is removed?
> > > > >
> > > > > Even if the base_snapshot is NULL, now we are collecting the
> > > > > txn->invalidation.
> > > > >
> > > >
> > > > But there doesn't seem to be any check even before this patch which
> > > > directly prohibits accumulating invalidations in DecodeCommit.  We
> > > > have check for base_snapshot in ReorderBufferCommit.  Did you get any
> > > > failure with that check?
> > >
> > > Because earlier ReorderBufferForget for toptxn will be called if the
> > > top transaction is aborted and in abort case, we are not logging any
> > > invalidation so that will be 0.  However same is not true now.
> > >
> >
> > AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when
> > we need to skip the transaction.  It doesn't seem to be called from
> > Abort path (DecodeAbort/ReorderBufferAbort doesn't use
> > ReorderBufferForget).  I am not sure which code path are you referring
> > here, can you please share the code flow which you are referring to
> > here.
>
> I think you are right,  during some intermediate code change, it
> crashed on that assert (I guess I might be adding invalidation to the
> sub-transaction but not sure what was that state) and I assumed that
> is the reason that I explained above but, now I see my assumption was
> wrong.  I will put back that assert.  By testing, I could not hit any
> case where we hit that assert even after my changes, still I will put
> more thought if by any chance our case is different then the base
> code.

Here is the POC patch to discuss the idea of a cleanup of shared
fileset on proc exit.  As discussed offlist,  here I am maintaining
the list of shared fileset.  First time when the list is NULL I am
registering the cleanup function with on_proc_exit routine.  After
that for subsequent fileset, I am just appending it to filesetlist.
There is also an interface to unregister the shared file set from the
cleanup list and that is done by the caller whenever we are deleting
the shared fileset manually.  While explaining it here, I think there
could be one issue if we delete all the element from the list will
become NULL and on next SharedFileSetInit we will again register the
function.  Maybe that is not a problem but we can avoid registering
multiple times by using some flag in the file

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Here is the POC patch to discuss the idea of a cleanup of shared
> fileset on proc exit.  As discussed offlist,  here I am maintaining
> the list of shared fileset.  First time when the list is NULL I am
> registering the cleanup function with on_proc_exit routine.  After
> that for subsequent fileset, I am just appending it to filesetlist.
> There is also an interface to unregister the shared file set from the
> cleanup list and that is done by the caller whenever we are deleting
> the shared fileset manually.  While explaining it here, I think there
> could be one issue if we delete all the element from the list will
> become NULL and on next SharedFileSetInit we will again register the
> function.  Maybe that is not a problem but we can avoid registering
> multiple times by using some flag in the file
>

I don't understand what you mean by "using some flag in the file".

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=================================
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;

+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
  SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
  fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

2.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+ bool found = false;
+ ListCell *l;
+
+ Assert(filesetlist != NULL);
+
+ /* Loop over all the pending shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+ /* remove the entry from the list and delete the underlying files */
+ if (input_fileset->number == fileset->number)
+ {
+ SharedFileSetDeleteAll(fileset);
+ filesetlist = list_delete_cell(filesetlist, l);

Why are we calling SharedFileSetDeleteAll here when in the caller we
have already deleted the fileset as per below code?
BufFileDeleteShared(ent->stream_fileset, path);
+ SharedFileSetUnregister(ent->stream_fileset);

I think it will be good if somehow we can remove the fileset from
filesetlist during BufFileDeleteShared.  If that is possible, then we
don't need a separate API for SharedFileSetUnregister.

3.
+static List * filesetlist = NULL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const
char *name);
 static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);

We use NIL for list initialization and comparison.  See lock_files usage.

4.
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+ ListCell *l;
+
+ /* Loop over all the pending  shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+ SharedFileSetDeleteAll(fileset);
+ }

We can initialize filesetlist as NIL after the for loop as it will
make the code look clean.

Comments on other patches:
=========================
5.
> 3. On concurrent abort we are truncating all the changes including
> some incomplete changes,  so later when we get the complete changes we
> don't have the previous changes,  e.g, if we had specinsert in the
> last stream and due to concurrent abort detection if we delete that
> changes later we will get spec_confirm without spec insert.  We could
> have simply avoided deleting all the changes, but I think the better
> fix is once we detect the concurrent abort for any transaction, then
> why do we need to collect the changes for that, we can simply avoid
> that.  So I have put that fix. (0006)
>

On similar lines, I think we need to skip processing message, see else
part of code in ReorderBufferQueueMessage.

6.
In v29-0002-Issue-individual-invalidations-with-wal_level-lo,
xact_desc_invalidations seems to be a subset of
standby_desc_invalidations, can we have a common code for them?

7.
I think we can avoid sending v29-0007-Track-statistics-for-streaming
this each time.  We can do this after the main patch is complete.
Also, we might need to change how and where these stats will be
tracked.  See the related discussion [1].

8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer,
  * Return oldest transaction in reorderbuffer
@@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb,
TransactionId xid,
  /* set the reference to top-level transaction */
  subtxn->toptxn = txn;

+ /* set the reference to toplevel transaction */
+ subtxn->toptxn = txn;
+

There is a double initialization of subtxn->toptxn.  You need to
remove this line from 0005 patch as we have now added it in an earlier
patch.

9.  I think you forgot to update the patch to execute invalidations in
Abort case or I might be missing something.  I don't see any changes
in ReorderBufferAbort. You have agreed in one of the emails above [2]
about handling the same.

10. In v29-0008-Add-support-for-streaming-to-built-in-replicatio,
 apply_handle_stream_commit(StringInfo s)
 {
 ..
 + /*
 + * send feedback to upstream
 + *
 + * XXX Probably should send a valid LSN. But which one?
 + */
 + send_feedback(InvalidXLogRecPtr, false, false);
 ..
 }

I have given a comment on this code that we don't need this feedback
and you mentioned on June 02 [3] that you will think on it and let me
know your opinion but I don't see a response from you yet.  Can you
get back to me regarding this point?

11. Add some comments as to why we have used Shared BufFile interface
instead of Temp BufFile interface?

12. In v29-0013-Change-buffile-interface-required-for-streaming,
+ * Initialize a space for temporary files that can be opened other backends.

/opened other backends/opened for access by other backends

[1] - https://www.postgresql.org/message-id/CA%2Bfd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAFiTN-t7WZZjFrAjSYj4fu%3DFZ2JKENN8ZHCUZaw-srnrHMWMrg%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-tHpd%2BzXVemo9WqQUJS50p9m8jD%3DAWjsugKZQ4F-K8Pbw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Mon, Jun 22, 2020 at 11:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 8.
> > + /*
> > + * Start a transaction on stream start, this transaction will be committed
> > + * on the stream stop.  We need the transaction for handling the buffile,
> > + * used for serializing the streaming data and subxact info.
> > + */
> > + ensure_transaction();
> >
> > I think we need this for PrepareTempTablespaces to set the
> > temptablespaces.  Also, isn't it required for a cleanup of buffile
> > resources at the transaction end?  Are there any other reasons for it
> > as well?  The comment should be a bit more clear for why we need a
> > transaction here.
>
> I am not sure that will it make sense to add a comment here that why
> buffile and sharedfileset need a transaction?
>

You can say usage of BufFile interface expects us to be in the
transaction for so and so reason....

  Do you think that we
> should add comment in buffile/shared fileset API that it should be
> called under a transaction?
>

I am fine with that as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
 iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Here is the POC patch to discuss the idea of a cleanup of shared
> > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > the list of shared fileset.  First time when the list is NULL I am
> > registering the cleanup function with on_proc_exit routine.  After
> > that for subsequent fileset, I am just appending it to filesetlist.
> > There is also an interface to unregister the shared file set from the
> > cleanup list and that is done by the caller whenever we are deleting
> > the shared fileset manually.  While explaining it here, I think there
> > could be one issue if we delete all the element from the list will
> > become NULL and on next SharedFileSetInit we will again register the
> > function.  Maybe that is not a problem but we can avoid registering
> > multiple times by using some flag in the file
> >
>
> I don't understand what you mean by "using some flag in the file".

Basically, in POC as shown in below code snippet,  We are checking
that if the "filesetlist" is NULL then only register the on_proc_exit
function.  But, as described above if all the items are deleted the
list will be NULL.  So I told that instead of checking the filesetlist
is NULL,  we can have just a boolean variable that if we have
registered the callback then don't do it again.

@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);
+
+ filesetlist = lcons((void *)fileset, filesetlist);
+ }
 }

>
> Review comments on various patches.
>
> poc_shared_fileset_cleanup_on_procexit
> =================================
> 1.
> - ent->subxact_fileset =
> - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> + MemoryContext oldctx;
>
> + /* Shared fileset handle must be allocated in the persistent context */
> + oldctx = MemoryContextSwitchTo(ApplyContext);
> + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
>   SharedFileSetInit(ent->subxact_fileset, NULL);
> + MemoryContextSwitchTo(oldctx);
>   fd = BufFileCreateShared(ent->subxact_fileset, path);
>
> Why is this change required for this patch and why we only cover
> SharedFileSetInit in the Apply context and not BufFileCreateShared?
> The comment is also not very clear on this point.

Because only the sharedfileset and the filesetlist which is allocated
under SharedFileSetInit, are required in the permanent context.
BufFileCreateShared, only creates the Buffile and VFD which will be
required only within the current stream so transaction context is
enough.

> 2.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> + bool found = false;
> + ListCell *l;
> +
> + Assert(filesetlist != NULL);
> +
> + /* Loop over all the pending shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> +
> + /* remove the entry from the list and delete the underlying files */
> + if (input_fileset->number == fileset->number)
> + {
> + SharedFileSetDeleteAll(fileset);
> + filesetlist = list_delete_cell(filesetlist, l);
>
> Why are we calling SharedFileSetDeleteAll here when in the caller we
> have already deleted the fileset as per below code?
> BufFileDeleteShared(ent->stream_fileset, path);
> + SharedFileSetUnregister(ent->stream_fileset);
>
> I think it will be good if somehow we can remove the fileset from
> filesetlist during BufFileDeleteShared.  If that is possible, then we
> don't need a separate API for SharedFileSetUnregister.

But the filesetlist is maintained at the sharedfileset level, so even
if we delete from BufFileDeleteShared, we need to call an API from the
sharedfileset layer to unregister the fileset.  Am I missing
something?

> 3.
> +static List * filesetlist = NULL;
> +
>  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> +static void SharedFileSetOnProcExit(int status, Datum arg);
>  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> tablespace);
>  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> char *name);
>  static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
> @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
>   /* Register our cleanup callback. */
>   if (seg)
>   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> + else
> + {
> + if (filesetlist == NULL)
> + on_proc_exit(SharedFileSetOnProcExit, 0);
>
> We use NIL for list initialization and comparison.  See lock_files usage.

Right.

> 4.
> +SharedFileSetOnProcExit(int status, Datum arg)
> +{
> + ListCell *l;
> +
> + /* Loop over all the pending  shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> + SharedFileSetDeleteAll(fileset);
> + }
>
> We can initialize filesetlist as NIL after the for loop as it will
> make the code look clean.

ok

Thanks for your feedback on this.  I will reply to other comments separately.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Wed, Jun 24, 2020 at 4:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>  iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Here is the POC patch to discuss the idea of a cleanup of shared
> > > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > > the list of shared fileset.  First time when the list is NULL I am
> > > registering the cleanup function with on_proc_exit routine.  After
> > > that for subsequent fileset, I am just appending it to filesetlist.
> > > There is also an interface to unregister the shared file set from the
> > > cleanup list and that is done by the caller whenever we are deleting
> > > the shared fileset manually.  While explaining it here, I think there
> > > could be one issue if we delete all the element from the list will
> > > become NULL and on next SharedFileSetInit we will again register the
> > > function.  Maybe that is not a problem but we can avoid registering
> > > multiple times by using some flag in the file
> > >
> >
> > I don't understand what you mean by "using some flag in the file".
>
> Basically, in POC as shown in below code snippet,  We are checking
> that if the "filesetlist" is NULL then only register the on_proc_exit
> function.  But, as described above if all the items are deleted the
> list will be NULL.  So I told that instead of checking the filesetlist
> is NULL,  we can have just a boolean variable that if we have
> registered the callback then don't do it again.
>

Check if there is any precedent of the same in the code?

>
> >
> > Review comments on various patches.
> >
> > poc_shared_fileset_cleanup_on_procexit
> > =================================
> > 1.
> > - ent->subxact_fileset =
> > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > + MemoryContext oldctx;
> >
> > + /* Shared fileset handle must be allocated in the persistent context */
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > + MemoryContextSwitchTo(oldctx);
> >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> >
> > Why is this change required for this patch and why we only cover
> > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > The comment is also not very clear on this point.
>
> Because only the sharedfileset and the filesetlist which is allocated
> under SharedFileSetInit, are required in the permanent context.
> BufFileCreateShared, only creates the Buffile and VFD which will be
> required only within the current stream so transaction context is
> enough.
>

Okay, then add some more comments to explain it or if you have
explained it elsewhere, then add a reference for the same.

> > 2.
> > +void
> > +SharedFileSetUnregister(SharedFileSet *input_fileset)
> > +{
> > + bool found = false;
> > + ListCell *l;
> > +
> > + Assert(filesetlist != NULL);
> > +
> > + /* Loop over all the pending shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > +
> > + /* remove the entry from the list and delete the underlying files */
> > + if (input_fileset->number == fileset->number)
> > + {
> > + SharedFileSetDeleteAll(fileset);
> > + filesetlist = list_delete_cell(filesetlist, l);
> >
> > Why are we calling SharedFileSetDeleteAll here when in the caller we
> > have already deleted the fileset as per below code?
> > BufFileDeleteShared(ent->stream_fileset, path);
> > + SharedFileSetUnregister(ent->stream_fileset);
> >
> > I think it will be good if somehow we can remove the fileset from
> > filesetlist during BufFileDeleteShared.  If that is possible, then we
> > don't need a separate API for SharedFileSetUnregister.
>
> But the filesetlist is maintained at the sharedfileset level, so even
> if we delete from BufFileDeleteShared, we need to call an API from the
> sharedfileset layer to unregister the fileset.
>

Sure, but isn't it better if we can call such an API from BufFileDeleteShared?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Here is the POC patch to discuss the idea of a cleanup of shared
> > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > the list of shared fileset.  First time when the list is NULL I am
> > registering the cleanup function with on_proc_exit routine.  After
> > that for subsequent fileset, I am just appending it to filesetlist.
> > There is also an interface to unregister the shared file set from the
> > cleanup list and that is done by the caller whenever we are deleting
> > the shared fileset manually.  While explaining it here, I think there
> > could be one issue if we delete all the element from the list will
> > become NULL and on next SharedFileSetInit we will again register the
> > function.  Maybe that is not a problem but we can avoid registering
> > multiple times by using some flag in the file
> >
>
> I don't understand what you mean by "using some flag in the file".
>
> Review comments on various patches.
>
> poc_shared_fileset_cleanup_on_procexit
> =================================
> 1.
> - ent->subxact_fileset =
> - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> + MemoryContext oldctx;
>
> + /* Shared fileset handle must be allocated in the persistent context */
> + oldctx = MemoryContextSwitchTo(ApplyContext);
> + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
>   SharedFileSetInit(ent->subxact_fileset, NULL);
> + MemoryContextSwitchTo(oldctx);
>   fd = BufFileCreateShared(ent->subxact_fileset, path);
>
> Why is this change required for this patch and why we only cover
> SharedFileSetInit in the Apply context and not BufFileCreateShared?
> The comment is also not very clear on this point.

Added the comments for the same.

> 2.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> + bool found = false;
> + ListCell *l;
> +
> + Assert(filesetlist != NULL);
> +
> + /* Loop over all the pending shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> +
> + /* remove the entry from the list and delete the underlying files */
> + if (input_fileset->number == fileset->number)
> + {
> + SharedFileSetDeleteAll(fileset);
> + filesetlist = list_delete_cell(filesetlist, l);
>
> Why are we calling SharedFileSetDeleteAll here when in the caller we
> have already deleted the fileset as per below code?
> BufFileDeleteShared(ent->stream_fileset, path);
> + SharedFileSetUnregister(ent->stream_fileset);

That's wrong I have removed this.


> I think it will be good if somehow we can remove the fileset from
> filesetlist during BufFileDeleteShared.  If that is possible, then we
> don't need a separate API for SharedFileSetUnregister.

I have done as discussed on later replies, basically called
SharedFileSetUnregister from BufFileDeleteShared.

> 3.
> +static List * filesetlist = NULL;
> +
>  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> +static void SharedFileSetOnProcExit(int status, Datum arg);
>  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> tablespace);
>  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> char *name);
>  static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
> @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
>   /* Register our cleanup callback. */
>   if (seg)
>   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> + else
> + {
> + if (filesetlist == NULL)
> + on_proc_exit(SharedFileSetOnProcExit, 0);
>
> We use NIL for list initialization and comparison.  See lock_files usage.

Done

> 4.
> +SharedFileSetOnProcExit(int status, Datum arg)
> +{
> + ListCell *l;
> +
> + /* Loop over all the pending  shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> + SharedFileSetDeleteAll(fileset);
> + }
>
> We can initialize filesetlist as NIL after the for loop as it will
> make the code look clean.

Right.

> Comments on other patches:
> =========================
> 5.
> > 3. On concurrent abort we are truncating all the changes including
> > some incomplete changes,  so later when we get the complete changes we
> > don't have the previous changes,  e.g, if we had specinsert in the
> > last stream and due to concurrent abort detection if we delete that
> > changes later we will get spec_confirm without spec insert.  We could
> > have simply avoided deleting all the changes, but I think the better
> > fix is once we detect the concurrent abort for any transaction, then
> > why do we need to collect the changes for that, we can simply avoid
> > that.  So I have put that fix. (0006)
> >
>
> On similar lines, I think we need to skip processing message, see else
> part of code in ReorderBufferQueueMessage.

Basically, ReorderBufferQueueMessage also calls the
ReorderBufferQueueChange internally for transactional changes.  But,
having said that, I realize the idea of skipping the changes in
ReorderBufferQueueChange is not good,  because by then we have already
allocated the memory for the change and the tuple and it's not a
correct to ReturnChanges because it will update the memory accounting.
So I think we can do it at a more centralized place and before we
process the change,  maybe in LogicalDecodingProcessRecord, before
going to the switch we can call a function from the reorderbuffer.c
layer to see whether this transaction is detected as aborted or not.
But I have to think more on this line that can we skip all the
processing of that record or not.

Your other comments look fine to me so I will send in the next patch
set and reply on them individually.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Here is the POC patch to discuss the idea of a cleanup of shared
> > > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > > the list of shared fileset.  First time when the list is NULL I am
> > > registering the cleanup function with on_proc_exit routine.  After
> > > that for subsequent fileset, I am just appending it to filesetlist.
> > > There is also an interface to unregister the shared file set from the
> > > cleanup list and that is done by the caller whenever we are deleting
> > > the shared fileset manually.  While explaining it here, I think there
> > > could be one issue if we delete all the element from the list will
> > > become NULL and on next SharedFileSetInit we will again register the
> > > function.  Maybe that is not a problem but we can avoid registering
> > > multiple times by using some flag in the file
> > >
> >
> > I don't understand what you mean by "using some flag in the file".
> >
> > Review comments on various patches.
> >
> > poc_shared_fileset_cleanup_on_procexit
> > =================================
> > 1.
> > - ent->subxact_fileset =
> > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > + MemoryContext oldctx;
> >
> > + /* Shared fileset handle must be allocated in the persistent context */
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > + MemoryContextSwitchTo(oldctx);
> >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> >
> > Why is this change required for this patch and why we only cover
> > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > The comment is also not very clear on this point.
>
> Added the comments for the same.
>
> > 2.
> > +void
> > +SharedFileSetUnregister(SharedFileSet *input_fileset)
> > +{
> > + bool found = false;
> > + ListCell *l;
> > +
> > + Assert(filesetlist != NULL);
> > +
> > + /* Loop over all the pending shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > +
> > + /* remove the entry from the list and delete the underlying files */
> > + if (input_fileset->number == fileset->number)
> > + {
> > + SharedFileSetDeleteAll(fileset);
> > + filesetlist = list_delete_cell(filesetlist, l);
> >
> > Why are we calling SharedFileSetDeleteAll here when in the caller we
> > have already deleted the fileset as per below code?
> > BufFileDeleteShared(ent->stream_fileset, path);
> > + SharedFileSetUnregister(ent->stream_fileset);
>
> That's wrong I have removed this.
>
>
> > I think it will be good if somehow we can remove the fileset from
> > filesetlist during BufFileDeleteShared.  If that is possible, then we
> > don't need a separate API for SharedFileSetUnregister.
>
> I have done as discussed on later replies, basically called
> SharedFileSetUnregister from BufFileDeleteShared.
>
> > 3.
> > +static List * filesetlist = NULL;
> > +
> >  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> > +static void SharedFileSetOnProcExit(int status, Datum arg);
> >  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> > tablespace);
> >  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> > char *name);
> >  static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
> > @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
> >   /* Register our cleanup callback. */
> >   if (seg)
> >   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> > + else
> > + {
> > + if (filesetlist == NULL)
> > + on_proc_exit(SharedFileSetOnProcExit, 0);
> >
> > We use NIL for list initialization and comparison.  See lock_files usage.
>
> Done
>
> > 4.
> > +SharedFileSetOnProcExit(int status, Datum arg)
> > +{
> > + ListCell *l;
> > +
> > + /* Loop over all the pending  shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > + SharedFileSetDeleteAll(fileset);
> > + }
> >
> > We can initialize filesetlist as NIL after the for loop as it will
> > make the code look clean.
>
> Right.
>
> > Comments on other patches:
> > =========================
> > 5.
> > > 3. On concurrent abort we are truncating all the changes including
> > > some incomplete changes,  so later when we get the complete changes we
> > > don't have the previous changes,  e.g, if we had specinsert in the
> > > last stream and due to concurrent abort detection if we delete that
> > > changes later we will get spec_confirm without spec insert.  We could
> > > have simply avoided deleting all the changes, but I think the better
> > > fix is once we detect the concurrent abort for any transaction, then
> > > why do we need to collect the changes for that, we can simply avoid
> > > that.  So I have put that fix. (0006)
> > >
> >
> > On similar lines, I think we need to skip processing message, see else
> > part of code in ReorderBufferQueueMessage.
>
> Basically, ReorderBufferQueueMessage also calls the
> ReorderBufferQueueChange internally for transactional changes.  But,
> having said that, I realize the idea of skipping the changes in
> ReorderBufferQueueChange is not good,  because by then we have already
> allocated the memory for the change and the tuple and it's not a
> correct to ReturnChanges because it will update the memory accounting.
> So I think we can do it at a more centralized place and before we
> process the change,  maybe in LogicalDecodingProcessRecord, before
> going to the switch we can call a function from the reorderbuffer.c
> layer to see whether this transaction is detected as aborted or not.
> But I have to think more on this line that can we skip all the
> processing of that record or not.
>
> Your other comments look fine to me so I will send in the next patch
> set and reply on them individually.

I think we can not put this check, in the higher-level functions like
LogicalDecodingProcessRecord or DecodeXXXOp because we need to process
that xid at least for abort,  so I think it is good to keep the check,
inside ReorderBufferQueueChange only and we can free the memory of the
change if the abort is detected.  Also, if just skip those changes in
ReorderBufferQueueChange then the effect will be localized to that
particular transaction which is already aborted.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Fri, Jun 26, 2020 at 10:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Comments on other patches:
> > > =========================
> > > 5.
> > > > 3. On concurrent abort we are truncating all the changes including
> > > > some incomplete changes,  so later when we get the complete changes we
> > > > don't have the previous changes,  e.g, if we had specinsert in the
> > > > last stream and due to concurrent abort detection if we delete that
> > > > changes later we will get spec_confirm without spec insert.  We could
> > > > have simply avoided deleting all the changes, but I think the better
> > > > fix is once we detect the concurrent abort for any transaction, then
> > > > why do we need to collect the changes for that, we can simply avoid
> > > > that.  So I have put that fix. (0006)
> > > >
> > >
> > > On similar lines, I think we need to skip processing message, see else
> > > part of code in ReorderBufferQueueMessage.
> >
> > Basically, ReorderBufferQueueMessage also calls the
> > ReorderBufferQueueChange internally for transactional changes.

Yes, that is correct but I was thinking about the non-transactional
part due to the below code there.

else
{
ReorderBufferTXN *txn = NULL;
volatile Snapshot snapshot_now = snapshot;

if (xid != InvalidTransactionId)
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

Even though we are using txn here but I think we don't need to skip it
for aborted xacts because without patch as well such messages get
decoded irrespective of transaction status.  What do you think?

> >  But,
> > having said that, I realize the idea of skipping the changes in
> > ReorderBufferQueueChange is not good,  because by then we have already
> > allocated the memory for the change and the tuple and it's not a
> > correct to ReturnChanges because it will update the memory accounting.
> > So I think we can do it at a more centralized place and before we
> > process the change,  maybe in LogicalDecodingProcessRecord, before
> > going to the switch we can call a function from the reorderbuffer.c
> > layer to see whether this transaction is detected as aborted or not.
> > But I have to think more on this line that can we skip all the
> > processing of that record or not.
> >
> > Your other comments look fine to me so I will send in the next patch
> > set and reply on them individually.
>
> I think we can not put this check, in the higher-level functions like
> LogicalDecodingProcessRecord or DecodeXXXOp because we need to process
> that xid at least for abort,  so I think it is good to keep the check,
> inside ReorderBufferQueueChange only and we can free the memory of the
> change if the abort is detected.  Also, if just skip those changes in
> ReorderBufferQueueChange then the effect will be localized to that
> particular transaction which is already aborted.
>

Fair enough and for cases like non-transactional part of
ReorderBufferQueueMessage, I think we anyway need to process the
message irrespective of transaction status.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Amit Kapila
Дата:
On Thu, Jun 25, 2020 at 7:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Review comments on various patches.
> >
> > poc_shared_fileset_cleanup_on_procexit
> > =================================
> > 1.
> > - ent->subxact_fileset =
> > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > + MemoryContext oldctx;
> >
> > + /* Shared fileset handle must be allocated in the persistent context */
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > + MemoryContextSwitchTo(oldctx);
> >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> >
> > Why is this change required for this patch and why we only cover
> > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > The comment is also not very clear on this point.
>
> Added the comments for the same.
>

1.
+ /*
+ * Shared fileset handle must be allocated in the persistent context.
+ * Also, SharedFileSetInit allocate the memory for sharefileset list
+ * so we need to allocate that in the long term meemory context.
+ */

How about "We need to maintain shared fileset across multiple stream
open/close calls.  So, we allocate it in a persistent context."

2.
+ /*
+ * If the caller is following the dsm based cleanup then we don't
+ * maintain the filesetlist so return.
+ */
+ if (filesetlist == NULL)
+ return;

The check here should use 'NIL' instead of 'NULL'

Other than that the changes in this particular patch looks good to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Comments on other patches:
> =========================

Replying to the pending comments.

> 6.
> In v29-0002-Issue-individual-invalidations-with-wal_level-lo,
> xact_desc_invalidations seems to be a subset of
> standby_desc_invalidations, can we have a common code for them?

Done

> 7.
> I think we can avoid sending v29-0007-Track-statistics-for-streaming
> this each time.  We can do this after the main patch is complete.
> Also, we might need to change how and where these stats will be
> tracked.  See the related discussion [1].

Removed

> 8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer,
>   * Return oldest transaction in reorderbuffer
> @@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb,
> TransactionId xid,
>   /* set the reference to top-level transaction */
>   subtxn->toptxn = txn;
>
> + /* set the reference to toplevel transaction */
> + subtxn->toptxn = txn;
> +
>
> There is a double initialization of subtxn->toptxn.  You need to
> remove this line from 0005 patch as we have now added it in an earlier
> patch.

Done

> 9.  I think you forgot to update the patch to execute invalidations in
> Abort case or I might be missing something.  I don't see any changes
> in ReorderBufferAbort. You have agreed in one of the emails above [2]
> about handling the same.

Done, check 0005

> 10. In v29-0008-Add-support-for-streaming-to-built-in-replicatio,
>  apply_handle_stream_commit(StringInfo s)
>  {
>  ..
>  + /*
>  + * send feedback to upstream
>  + *
>  + * XXX Probably should send a valid LSN. But which one?
>  + */
>  + send_feedback(InvalidXLogRecPtr, false, false);
>  ..
>  }
>
> I have given a comment on this code that we don't need this feedback
> and you mentioned on June 02 [3] that you will think on it and let me
> know your opinion but I don't see a response from you yet.  Can you
> get back to me regarding this point?

Yeah, I have analyzed this and this seems we don't need this.  Because
like non-streaming mode here also sending feedback mechanisms shall be
the same.  I don't see any reason for sending extra feedback on
commit.

> 11. Add some comments as to why we have used Shared BufFile interface
> instead of Temp BufFile interface?

Done

> 12. In v29-0013-Change-buffile-interface-required-for-streaming,
> + * Initialize a space for temporary files that can be opened other backends.
>
> /opened other backends/opened for access by other backends

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Fri, Jun 26, 2020 at 11:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 25, 2020 at 7:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Review comments on various patches.
> > >
> > > poc_shared_fileset_cleanup_on_procexit
> > > =================================
> > > 1.
> > > - ent->subxact_fileset =
> > > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > > + MemoryContext oldctx;
> > >
> > > + /* Shared fileset handle must be allocated in the persistent context */
> > > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> > >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > > + MemoryContextSwitchTo(oldctx);
> > >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> > >
> > > Why is this change required for this patch and why we only cover
> > > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > > The comment is also not very clear on this point.
> >
> > Added the comments for the same.
> >
>
> 1.
> + /*
> + * Shared fileset handle must be allocated in the persistent context.
> + * Also, SharedFileSetInit allocate the memory for sharefileset list
> + * so we need to allocate that in the long term meemory context.
> + */
>
> How about "We need to maintain shared fileset across multiple stream
> open/close calls.  So, we allocate it in a persistent context."

Done

> 2.
> + /*
> + * If the caller is following the dsm based cleanup then we don't
> + * maintain the filesetlist so return.
> + */
> + if (filesetlist == NULL)
> + return;
>
> The check here should use 'NIL' instead of 'NULL'

Done

> Other than that the changes in this particular patch looks good to me.
Added as a last patch in the series, in the next version I will merge
this to 0012 and 0013.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Yes, I have made the changes.  Basically, now I am only using the
> > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
> > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
> > > are directly appending it to the txn->invalidations.  I have tested
> > > the XLOG_INVALIDATIONS part but while sending this mail I realized
> > > that we could write some automated test for the same.
> > >
> >
> > Can you share how you have tested it?
> >
> > >  I will work on
> > > that soon.
> > >
> >
> > Cool, I think having a regression test for this will be a good idea.
> >
>
> Other than above tests, can we somehow verify that the invalidations
> generated at commit time are the same as what we do with this patch?
> We have verified with individual commands but it would be great if we
> can verify for the regression tests.

I have verified this using a few random test cases.  For verifying
this I have made some temporary code changes with an assert as shown
below.  Basically, on DecodeCommit we call
ReorderBufferAddInvalidations function only for an assert checking.

-void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
                                                          XLogRecPtr
lsn, Size nmsgs,
-
SharedInvalidationMessage *msgs)
+
SharedInvalidationMessage *msgs, bool commit)
 {
        ReorderBufferTXN *txn;

        txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
-
+       if (commit)
+       {
+               Assert(txn->ninvalidations == nmsgs);
+               return;
+       }

The result is that for a normal local test it works fine.  But with
regression suit, it hit an assert at many places because if the
rollback of the subtransaction is involved then at commit time
invalidation messages those are not logged whereas with command time
invalidation those are logged.

As of now, I have only put assert on the count,  if we need to verify
the exact messages then we might need to somehow categories the
invalidation messages because the ordering of the messages will not be
the same.  For testing this we will have to arrange them by category
i.e relcahce, catcache and then we can compare them.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Other than above tests, can we somehow verify that the invalidations
> > generated at commit time are the same as what we do with this patch?
> > We have verified with individual commands but it would be great if we
> > can verify for the regression tests.
>
> I have verified this using a few random test cases.  For verifying
> this I have made some temporary code changes with an assert as shown
> below.  Basically, on DecodeCommit we call
> ReorderBufferAddInvalidations function only for an assert checking.
>
> -void
>  ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
>                                                           XLogRecPtr
> lsn, Size nmsgs,
> -
> SharedInvalidationMessage *msgs)
> +
> SharedInvalidationMessage *msgs, bool commit)
>  {
>         ReorderBufferTXN *txn;
>
>         txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> -
> +       if (commit)
> +       {
> +               Assert(txn->ninvalidations == nmsgs);
> +               return;
> +       }
>
> The result is that for a normal local test it works fine.  But with
> regression suit, it hit an assert at many places because if the
> rollback of the subtransaction is involved then at commit time
> invalidation messages those are not logged whereas with command time
> invalidation those are logged.
>

Yeah, somehow, we need to ignore rollback to savepoint tests and
verify for others.

> As of now, I have only put assert on the count,  if we need to verify
> the exact messages then we might need to somehow categories the
> invalidation messages because the ordering of the messages will not be
> the same.  For testing this we will have to arrange them by category
> i.e relcahce, catcache and then we can compare them.
>

Can't we do this by verifying that each message at commit time exists
in the list of invalidation messages we have collected via processing
XLOG_XACT_INVALIDATIONS?

One additional question on patch
v30-0003-Extend-the-output-plugin-API-with-stream-methods:
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr apply_lsn)
{
..
..
+ state.report_location = apply_lsn;
..
..
+ ctx->write_location = apply_lsn;
..
}

Can't we name the last parameter as 'commit_lsn' as that is how
documentation in the patch spells it and it sounds more appropriate?
Also, is there a reason for assigning report_location and
write_location differently than what we do in commit_cb_wrapper?
Basically, assign those as txn->final_lsn and txn->end_lsn
respectively.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Other than above tests, can we somehow verify that the invalidations
> > > generated at commit time are the same as what we do with this patch?
> > > We have verified with individual commands but it would be great if we
> > > can verify for the regression tests.
> >
> > I have verified this using a few random test cases.  For verifying
> > this I have made some temporary code changes with an assert as shown
> > below.  Basically, on DecodeCommit we call
> > ReorderBufferAddInvalidations function only for an assert checking.
> >
> > -void
> >  ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
> >                                                           XLogRecPtr
> > lsn, Size nmsgs,
> > -
> > SharedInvalidationMessage *msgs)
> > +
> > SharedInvalidationMessage *msgs, bool commit)
> >  {
> >         ReorderBufferTXN *txn;
> >
> >         txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > -
> > +       if (commit)
> > +       {
> > +               Assert(txn->ninvalidations == nmsgs);
> > +               return;
> > +       }
> >
> > The result is that for a normal local test it works fine.  But with
> > regression suit, it hit an assert at many places because if the
> > rollback of the subtransaction is involved then at commit time
> > invalidation messages those are not logged whereas with command time
> > invalidation those are logged.
> >
>
> Yeah, somehow, we need to ignore rollback to savepoint tests and
> verify for others.

Yeah, I have run the regression suite,  I can see a lot of failure
maybe we can somehow see the diff and confirm that all the failures
are due to rollback to savepoint only.  I will work on this.

>
> > As of now, I have only put assert on the count,  if we need to verify
> > the exact messages then we might need to somehow categories the
> > invalidation messages because the ordering of the messages will not be
> > the same.  For testing this we will have to arrange them by category
> > i.e relcahce, catcache and then we can compare them.
> >
>
> Can't we do this by verifying that each message at commit time exists
> in the list of invalidation messages we have collected via processing
> XLOG_XACT_INVALIDATIONS?

Let me try what is the easiest way to test this.

>
> One additional question on patch
> v30-0003-Extend-the-output-plugin-API-with-stream-methods:
> +static void
> +stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
> + XLogRecPtr apply_lsn)
> {
> ..
> ..
> + state.report_location = apply_lsn;
> ..
> ..
> + ctx->write_location = apply_lsn;
> ..
> }
>
> Can't we name the last parameter as 'commit_lsn' as that is how
> documentation in the patch spells it and it sounds more appropriate?

You are right commit_lsn seems more appropriate here.

> Also, is there a reason for assigning report_location and
> write_location differently than what we do in commit_cb_wrapper?
> Basically, assign those as txn->final_lsn and txn->end_lsn
> respectively.

Yes, I think it should be handled in same way as commit_cb_wrapper.
Because before calling ReorderBufferStreamCommit in
ReorderBufferCommit, we are properly updating the final_lsn as well as
the end_lsn.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Can't we name the last parameter as 'commit_lsn' as that is how
> > documentation in the patch spells it and it sounds more appropriate?
>
> You are right commit_lsn seems more appropriate here.
>
> > Also, is there a reason for assigning report_location and
> > write_location differently than what we do in commit_cb_wrapper?
> > Basically, assign those as txn->final_lsn and txn->end_lsn
> > respectively.
>
> Yes, I think it should be handled in same way as commit_cb_wrapper.
> Because before calling ReorderBufferStreamCommit in
> ReorderBufferCommit, we are properly updating the final_lsn as well as
> the end_lsn.
>

Okay, I have made these changes in the attached patch and there are
few more changes in
0003-Extend-the-output-plugin-API-with-stream-methods.
1. In pg_decode_stream_message, for transactional messages, we were
displaying message contents which is different from other streaming
APIs.  I have changed it so that streaming API doesn't display message
contents for transactional messages.

2.
+ /* in streaming mode, stream_change_cb is required */
+ if (ctx->callbacks.stream_change_cb == NULL)
+ ereport(ERROR,
+ (errmsg("Output plugin supports streaming, but has not registered "
+ "stream_change_cb callback.")));

The error messages seem a bit weird.  (a) doesn't include error code,
(b) not in PG style. I have changed all the error messages to fix
these two issues and change the message as well

3. Rearranged the functions stream_* so that the optional functions
are at the end and also arranged other functions in a way that looks
more logical to me.

4. Updated comments, commit message, and edited docs in the patch.

I have made a few changes in
0004-Gracefully-handle-concurrent-aborts-of-transacti as well.
1. The variable bsysscan was not being reset in case of error.  I have
introduced a new function to reset both bsysscan and CheckXidAlive
during transaction abort.  Also, snapmgr.c doesn't seem right place
for these variables, so I moved them to xact.c.  I think this will
make the initialization of CheckXidAlive during catch in
ReorderBufferProcessTXN redundant.

2. Updated comments and commit message.

Let me know what you think about the above changes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Let me know what you think about the above changes.
>

I went ahead and made few changes in
0005-Implement-streaming-mode-in-ReorderBuffer which are explained
below.  I have few questions and suggestions for the patch as well
which are also covered in below points.

1.
+ if (prev_lsn == InvalidXLogRecPtr)
+ {
+ if (streaming)
+ rb->stream_start(rb, txn, change->lsn);
+ else
+ rb->begin(rb, txn);
+ stream_started = true;
+ }

I don't think we want to move begin callback here that will change the
existing semantics, so it is better to move begin at its original
position. I have made the required changes in the attached patch.

2.
ReorderBufferTruncateTXN()
{
..
+ dlist_foreach_modify(iter, &txn->changes)
+ {
+ ReorderBufferChange *change;
+
+ change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+ /* remove the change from it's containing list */
+ dlist_delete(&change->node);
+
+ ReorderBufferReturnChange(rb, change);
+ }
..
}

I think here we can add an Assert that we're not mixing changes from
different transactions.  See the changes in the patch.

3.
SetupCheckXidLive()
{
..
+ /*
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also, reset the
+ * bsysscan flag.
+ */
+ if (!TransactionIdDidCommit(xid))
+ {
+ CheckXidAlive = xid;
+ bsysscan = false;
..
}

What is the need to reset bsysscan flag here if we are already
resetting on error (like in the previous patch sent by me)?

4.
ReorderBufferProcessTXN()
{
..
..
+ /* Reset the CheckXidAlive */
+ if (streaming)
+ CheckXidAlive = InvalidTransactionId;
..
}

Similar to the previous point, we don't need this as well because
AbortCurrentTransaction would have taken care of this.

5.
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)

The above comment doesn't make much sense to me, so I have removed it.
Basically, if there are no changes before commit, we still need to
send commit and anyway if there are no more changes
ReorderBufferProcessTXN will not do anything.

6.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
if (txn->snapshot_now == NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+ Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id == InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+ subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+ ReorderBufferTransferSnapToParent(txn, subtxn);
+ }
..
}

Here, it is possible that there is no base_snapshot for txn, so we
need a check for that similar to ReorderBufferCommit.

7.  Apart from the above, I made few changes in comments and ran pgindent.

8. We can't stream the transaction before we reach the
SNAPBUILD_CONSISTENT state because some other output plugin can apply
those changes unlike what we do with pgoutput plugin (which writes to
file). And, I think applying the transactions without reaching a
consistent state would be anyway wrong.  So, we should avoid that and
if do that then we should have an Assert for streamed txns rather than
sending abort for them in ReorderBufferForget.

9.
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
..
+ ReorderBufferToastReset(rb, txn);
+ if (specinsert != NULL)
+ ReorderBufferReturnChange(rb, specinsert);
..
}

Why do we need to do these here when we wouldn't have been done for
any exception other than ERRCODE_TRANSACTION_ROLLBACK?

10.  I have got the below failure once.  I have not investigated this
in detail as the patch is still under progress.  See, if you have any
idea?
#   Failed test 'check extra columns contain local defaults'
#   at t/013_stream_subxact_ddl_abort.pl line 81.
#          got: '2|0'
#     expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

11. Can we test by introducing a new GUC such that all the
transactions (at least in existing tests) start to stream?  Basically,
it will allow us to disregard logical_decoding_work_mem and ensure
that all regression tests pass through new-code.  Note, I am
suggesting this just for testing purposes, not for actual integration
in the code.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Can't we name the last parameter as 'commit_lsn' as that is how
> > > documentation in the patch spells it and it sounds more appropriate?
> >
> > You are right commit_lsn seems more appropriate here.
> >
> > > Also, is there a reason for assigning report_location and
> > > write_location differently than what we do in commit_cb_wrapper?
> > > Basically, assign those as txn->final_lsn and txn->end_lsn
> > > respectively.
> >
> > Yes, I think it should be handled in same way as commit_cb_wrapper.
> > Because before calling ReorderBufferStreamCommit in
> > ReorderBufferCommit, we are properly updating the final_lsn as well as
> > the end_lsn.
> >
>
> Okay, I have made these changes in the attached patch and there are
> few more changes in
> 0003-Extend-the-output-plugin-API-with-stream-methods.
> 1. In pg_decode_stream_message, for transactional messages, we were
> displaying message contents which is different from other streaming
> APIs.  I have changed it so that streaming API doesn't display message
> contents for transactional messages.

Ok, make sense.

> 2.
> + /* in streaming mode, stream_change_cb is required */
> + if (ctx->callbacks.stream_change_cb == NULL)
> + ereport(ERROR,
> + (errmsg("Output plugin supports streaming, but has not registered "
> + "stream_change_cb callback.")));
>
> The error messages seem a bit weird.  (a) doesn't include error code,
> (b) not in PG style. I have changed all the error messages to fix
> these two issues and change the message as well

ok

> 3. Rearranged the functions stream_* so that the optional functions
> are at the end and also arranged other functions in a way that looks
> more logical to me.

Make sense to me.

> 4. Updated comments, commit message, and edited docs in the patch.
>
> I have made a few changes in
> 0004-Gracefully-handle-concurrent-aborts-of-transacti as well.
> 1. The variable bsysscan was not being reset in case of error.  I have
> introduced a new function to reset both bsysscan and CheckXidAlive
> during transaction abort.  Also, snapmgr.c doesn't seem right place
> for these variables, so I moved them to xact.c.  I think this will
> make the initialization of CheckXidAlive during catch in
> ReorderBufferProcessTXN redundant.

That looks better.

> 2. Updated comments and commit message.
>
> Let me know what you think about the above changes.

All the above changes look good to me and I will include in the next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Let me know what you think about the above changes.
> >
>
> I went ahead and made few changes in
> 0005-Implement-streaming-mode-in-ReorderBuffer which are explained
> below.  I have few questions and suggestions for the patch as well
> which are also covered in below points.
>
> 1.
> + if (prev_lsn == InvalidXLogRecPtr)
> + {
> + if (streaming)
> + rb->stream_start(rb, txn, change->lsn);
> + else
> + rb->begin(rb, txn);
> + stream_started = true;
> + }
>
> I don't think we want to move begin callback here that will change the
> existing semantics, so it is better to move begin at its original
> position. I have made the required changes in the attached patch.

Looks good to me.

> 2.
> ReorderBufferTruncateTXN()
> {
> ..
> + dlist_foreach_modify(iter, &txn->changes)
> + {
> + ReorderBufferChange *change;
> +
> + change = dlist_container(ReorderBufferChange, node, iter.cur);
> +
> + /* remove the change from it's containing list */
> + dlist_delete(&change->node);
> +
> + ReorderBufferReturnChange(rb, change);
> + }
> ..
> }
>
> I think here we can add an Assert that we're not mixing changes from
> different transactions.  See the changes in the patch.

Looks fine.

> 3.
> SetupCheckXidLive()
> {
> ..
> + /*
> + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access.  Also, reset the
> + * bsysscan flag.
> + */
> + if (!TransactionIdDidCommit(xid))
> + {
> + CheckXidAlive = xid;
> + bsysscan = false;
> ..
> }
>
> What is the need to reset bsysscan flag here if we are already
> resetting on error (like in the previous patch sent by me)?

Yeah, now we don't not need this.

> 4.
> ReorderBufferProcessTXN()
> {
> ..
> ..
> + /* Reset the CheckXidAlive */
> + if (streaming)
> + CheckXidAlive = InvalidTransactionId;
> ..
> }
>
> Similar to the previous point, we don't need this as well because
> AbortCurrentTransaction would have taken care of this.

Right

> 5.
> + * XXX Do we need to check if the transaction has some changes to stream
> + * (maybe it got streamed right before the commit, which attempts to
> + * stream it again before the commit)?
> + */
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
>
> The above comment doesn't make much sense to me, so I have removed it.
> Basically, if there are no changes before commit, we still need to
> send commit and anyway if there are no more changes
> ReorderBufferProcessTXN will not do anything.

ok

> 6.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> if (txn->snapshot_now == NULL)
> + {
> + dlist_iter subxact_i;
> +
> + /* make sure this transaction is streamed for the first time */
> + Assert(!rbtxn_is_streamed(txn));
> +
> + /* at the beginning we should have invalid command ID */
> + Assert(txn->command_id == InvalidCommandId);
> +
> + dlist_foreach(subxact_i, &txn->subtxns)
> + {
> + ReorderBufferTXN *subtxn;
> +
> + subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> + ReorderBufferTransferSnapToParent(txn, subtxn);
> + }
> ..
> }
>
> Here, it is possible that there is no base_snapshot for txn, so we
> need a check for that similar to ReorderBufferCommit.
>
> 7.  Apart from the above, I made few changes in comments and ran pgindent.

Ok

> 8. We can't stream the transaction before we reach the
> SNAPBUILD_CONSISTENT state because some other output plugin can apply
> those changes unlike what we do with pgoutput plugin (which writes to
> file). And, I think applying the transactions without reaching a
> consistent state would be anyway wrong.  So, we should avoid that and
> if do that then we should have an Assert for streamed txns rather than
> sending abort for them in ReorderBufferForget.

I will work on this point.

> 9.
> +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> {
> ..
> + ReorderBufferToastReset(rb, txn);
> + if (specinsert != NULL)
> + ReorderBufferReturnChange(rb, specinsert);
> ..
> }
>
> Why do we need to do these here when we wouldn't have been done for
> any exception other than ERRCODE_TRANSACTION_ROLLBACK?

Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
gracefully and we are continuing with further decoding so we need to
return this change back.

> 10.  I have got the below failure once.  I have not investigated this
> in detail as the patch is still under progress.  See, if you have any
> idea?
> #   Failed test 'check extra columns contain local defaults'
> #   at t/013_stream_subxact_ddl_abort.pl line 81.
> #          got: '2|0'
> #     expected: '1000|500'
> # Looks like you failed 1 test of 2.
> make[2]: *** [check] Error 1
> make[1]: *** [check-subscription-recurse] Error 2
> make[1]: *** Waiting for unfinished jobs....
> make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce.  I
have executed it multiple time but it did not reproduce again.  Are
you able to reproduce it consistently?

> 11. Can we test by introducing a new GUC such that all the
> transactions (at least in existing tests) start to stream?  Basically,
> it will allow us to disregard logical_decoding_work_mem and ensure
> that all regression tests pass through new-code.  Note, I am
> suggesting this just for testing purposes, not for actual integration
> in the code.

Yeah,  that's a good suggestion.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > Other than above tests, can we somehow verify that the invalidations
> > > > generated at commit time are the same as what we do with this patch?
> > > > We have verified with individual commands but it would be great if we
> > > > can verify for the regression tests.
> > >
> > > I have verified this using a few random test cases.  For verifying
> > > this I have made some temporary code changes with an assert as shown
> > > below.  Basically, on DecodeCommit we call
> > > ReorderBufferAddInvalidations function only for an assert checking.
> > >
> > > -void
> > >  ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
> > >                                                           XLogRecPtr
> > > lsn, Size nmsgs,
> > > -
> > > SharedInvalidationMessage *msgs)
> > > +
> > > SharedInvalidationMessage *msgs, bool commit)
> > >  {
> > >         ReorderBufferTXN *txn;
> > >
> > >         txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > -
> > > +       if (commit)
> > > +       {
> > > +               Assert(txn->ninvalidations == nmsgs);
> > > +               return;
> > > +       }
> > >
> > > The result is that for a normal local test it works fine.  But with
> > > regression suit, it hit an assert at many places because if the
> > > rollback of the subtransaction is involved then at commit time
> > > invalidation messages those are not logged whereas with command time
> > > invalidation those are logged.
> > >
> >
> > Yeah, somehow, we need to ignore rollback to savepoint tests and
> > verify for others.
>
> Yeah, I have run the regression suite,  I can see a lot of failure
> maybe we can somehow see the diff and confirm that all the failures
> are due to rollback to savepoint only.  I will work on this.

I have compared the changes logged at command end vs logged at commit
time.  I have ignored the invalidation for the transaction which has
any aborted subtransaction in it.  While testing this I found one
issue, the issue is that if there are some invalidation generated
between last command counter increment and the commit transaction then
those were not logged.  I have fixed the issue by logging the pending
invalidation in RecordTransactionCommit.  I will include the changes
in the next patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 9.
> > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > {
> > ..
> > + ReorderBufferToastReset(rb, txn);
> > + if (specinsert != NULL)
> > + ReorderBufferReturnChange(rb, specinsert);
> > ..
> > }
> >
> > Why do we need to do these here when we wouldn't have been done for
> > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
>
> Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> gracefully and we are continuing with further decoding so we need to
> return this change back.
>

Okay, then I suggest we should do these before calling stream_stop and
also move ReorderBufferResetTXN after calling stream_stop  to follow a
pattern similar to try block unless there is a reason for not doing
so.  Also, it would be good if we can initialize specinsert with NULL
after returning the change as we are doing at other places.

> > 10.  I have got the below failure once.  I have not investigated this
> > in detail as the patch is still under progress.  See, if you have any
> > idea?
> > #   Failed test 'check extra columns contain local defaults'
> > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > #          got: '2|0'
> > #     expected: '1000|500'
> > # Looks like you failed 1 test of 2.
> > make[2]: *** [check] Error 1
> > make[1]: *** [check-subscription-recurse] Error 2
> > make[1]: *** Waiting for unfinished jobs....
> > make: *** [check-world-src/test-recurse] Error 2
>
> Even I got the failure once and after that, it did not reproduce.  I
> have executed it multiple time but it did not reproduce again.  Are
> you able to reproduce it consistently?
>

No, I am also not able to reproduce it consistently but I think this
can fail if a subscriber sends the replay_location before actually
replaying the changes.  First, I thought that extra send_feedback we
have in apply_handle_stream_commit might have caused this but I guess
that can't happen because we need the commit time location for that
and we are storing the same at the end of apply_handle_stream_commit
after applying all messages.  I am not sure what is going on here.  I
think we somehow need to reproduce this or some variant of this test
consistently to find the root cause.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > 9.
> > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > {
> > > ..
> > > + ReorderBufferToastReset(rb, txn);
> > > + if (specinsert != NULL)
> > > + ReorderBufferReturnChange(rb, specinsert);
> > > ..
> > > }
> > >
> > > Why do we need to do these here when we wouldn't have been done for
> > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> >
> > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > gracefully and we are continuing with further decoding so we need to
> > return this change back.
> >
>
> Okay, then I suggest we should do these before calling stream_stop and
> also move ReorderBufferResetTXN after calling stream_stop  to follow a
> pattern similar to try block unless there is a reason for not doing
> so.  Also, it would be good if we can initialize specinsert with NULL
> after returning the change as we are doing at other places.

Okay

> > > 10.  I have got the below failure once.  I have not investigated this
> > > in detail as the patch is still under progress.  See, if you have any
> > > idea?
> > > #   Failed test 'check extra columns contain local defaults'
> > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > #          got: '2|0'
> > > #     expected: '1000|500'
> > > # Looks like you failed 1 test of 2.
> > > make[2]: *** [check] Error 1
> > > make[1]: *** [check-subscription-recurse] Error 2
> > > make[1]: *** Waiting for unfinished jobs....
> > > make: *** [check-world-src/test-recurse] Error 2
> >
> > Even I got the failure once and after that, it did not reproduce.  I
> > have executed it multiple time but it did not reproduce again.  Are
> > you able to reproduce it consistently?
> >
>
> No, I am also not able to reproduce it consistently but I think this
> can fail if a subscriber sends the replay_location before actually
> replaying the changes.  First, I thought that extra send_feedback we
> have in apply_handle_stream_commit might have caused this but I guess
> that can't happen because we need the commit time location for that
> and we are storing the same at the end of apply_handle_stream_commit
> after applying all messages.  I am not sure what is going on here.  I
> think we somehow need to reproduce this or some variant of this test
> consistently to find the root cause.

And I think it appeared first time for me,  so maybe either induced
from past few versions so some changes in the last few versions might
have exposed it.  I have noticed that almost 50% of the time I am able
to reproduce after the clean build so I can trace back from which
version it started appearing that way it will be easy to narrow down.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Jul 6, 2020 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > > > 10.  I have got the below failure once.  I have not investigated this
> > > > in detail as the patch is still under progress.  See, if you have any
> > > > idea?
> > > > #   Failed test 'check extra columns contain local defaults'
> > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > #          got: '2|0'
> > > > #     expected: '1000|500'
> > > > # Looks like you failed 1 test of 2.
> > > > make[2]: *** [check] Error 1
> > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > make[1]: *** Waiting for unfinished jobs....
> > > > make: *** [check-world-src/test-recurse] Error 2
> > >
> > > Even I got the failure once and after that, it did not reproduce.  I
> > > have executed it multiple time but it did not reproduce again.  Are
> > > you able to reproduce it consistently?
> > >
> >
> > No, I am also not able to reproduce it consistently but I think this
> > can fail if a subscriber sends the replay_location before actually
> > replaying the changes.  First, I thought that extra send_feedback we
> > have in apply_handle_stream_commit might have caused this but I guess
> > that can't happen because we need the commit time location for that
> > and we are storing the same at the end of apply_handle_stream_commit
> > after applying all messages.  I am not sure what is going on here.  I
> > think we somehow need to reproduce this or some variant of this test
> > consistently to find the root cause.
>
> And I think it appeared first time for me,  so maybe either induced
> from past few versions so some changes in the last few versions might
> have exposed it.  I have noticed that almost 50% of the time I am able
> to reproduce after the clean build so I can trace back from which
> version it started appearing that way it will be easy to narrow down.
>

One more comment
ReorderBufferLargestTopTXN
{
..
dlist_foreach(iter, &rb->toplevel_by_lsn)
  {
  ReorderBufferTXN *txn;
+ Size size = 0;
+ Size largest_size = 0;

  txn = dlist_container(ReorderBufferTXN, node, iter.cur);

- /* if the current transaction is larger, remember it */
- if ((!largest) || (txn->size > largest->size))
+ /*
+ * If this transaction have some incomplete changes then only consider
+ * the size upto last complete lsn.
+ */
+ if (rbtxn_has_incomplete_tuple(txn))
+ size = txn->complete_size;
+ else
+ size = txn->total_size;
+
+ /* If the current transaction is larger then remember it. */
+ if ((largest != NULL || size > largest_size) && size > 0)

Here largest_size is a local variable inside the loop which is
initialized to 0 in each iteration and that will lead to picking each
next txn as largest.  This seems wrong to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 6, 2020 at 3:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > idea?
> > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > #          got: '2|0'
> > > > > #     expected: '1000|500'
> > > > > # Looks like you failed 1 test of 2.
> > > > > make[2]: *** [check] Error 1
> > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > make[1]: *** Waiting for unfinished jobs....
> > > > > make: *** [check-world-src/test-recurse] Error 2
> > > >
> > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > have executed it multiple time but it did not reproduce again.  Are
> > > > you able to reproduce it consistently?
> > > >
> > >
> > > No, I am also not able to reproduce it consistently but I think this
> > > can fail if a subscriber sends the replay_location before actually
> > > replaying the changes.  First, I thought that extra send_feedback we
> > > have in apply_handle_stream_commit might have caused this but I guess
> > > that can't happen because we need the commit time location for that
> > > and we are storing the same at the end of apply_handle_stream_commit
> > > after applying all messages.  I am not sure what is going on here.  I
> > > think we somehow need to reproduce this or some variant of this test
> > > consistently to find the root cause.
> >
> > And I think it appeared first time for me,  so maybe either induced
> > from past few versions so some changes in the last few versions might
> > have exposed it.  I have noticed that almost 50% of the time I am able
> > to reproduce after the clean build so I can trace back from which
> > version it started appearing that way it will be easy to narrow down.
> >
>
> One more comment
> ReorderBufferLargestTopTXN
> {
> ..
> dlist_foreach(iter, &rb->toplevel_by_lsn)
>   {
>   ReorderBufferTXN *txn;
> + Size size = 0;
> + Size largest_size = 0;
>
>   txn = dlist_container(ReorderBufferTXN, node, iter.cur);
>
> - /* if the current transaction is larger, remember it */
> - if ((!largest) || (txn->size > largest->size))
> + /*
> + * If this transaction have some incomplete changes then only consider
> + * the size upto last complete lsn.
> + */
> + if (rbtxn_has_incomplete_tuple(txn))
> + size = txn->complete_size;
> + else
> + size = txn->total_size;
> +
> + /* If the current transaction is larger then remember it. */
> + if ((largest != NULL || size > largest_size) && size > 0)
>
> Here largest_size is a local variable inside the loop which is
> initialized to 0 in each iteration and that will lead to picking each
> next txn as largest.  This seems wrong to me.

You are right, will fix.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > Yeah, I have run the regression suite,  I can see a lot of failure
> > maybe we can somehow see the diff and confirm that all the failures
> > are due to rollback to savepoint only.  I will work on this.
>
> I have compared the changes logged at command end vs logged at commit
> time.  I have ignored the invalidation for the transaction which has
> any aborted subtransaction in it.  While testing this I found one
> issue, the issue is that if there are some invalidation generated
> between last command counter increment and the commit transaction then
> those were not logged.  I have fixed the issue by logging the pending
> invalidation in RecordTransactionCommit.  I will include the changes
> in the next patch set.
>

I think it would have been better if you could have given examples for
such cases where you need this extra logging.  Anyway, below are few
minor comments on this patch:

1.
+ /*
+ * Log any pending invalidations which are adding between the last
+ * command counter increment and the commit.
+ */
+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();

I think we can change this comment slightly and extend a bit to say
for which kind of special cases we are adding this. "Log any pending
invalidations which are added between the last CommandCounterIncrement
and the commit.  Normally for DDLs, we log this at each command end,
however for certain cases where we directly update the system table
the invalidations were not logged at command end."

Something like above based on cases that are not covered by command
end WAL logging.

2.
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+void
+LogLogicalInvalidations()

After this is getting used at a new place, it is better to modify the
above comment to something like: "Emit WAL for invalidations.  This is
currently only used for logging invalidations at the command end or at
commit time if any invalidations are pending."

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have compared the changes logged at command end vs logged at commit
> > time.  I have ignored the invalidation for the transaction which has
> > any aborted subtransaction in it.  While testing this I found one
> > issue, the issue is that if there are some invalidation generated
> > between last command counter increment and the commit transaction then
> > those were not logged.  I have fixed the issue by logging the pending
> > invalidation in RecordTransactionCommit.  I will include the changes
> > in the next patch set.
> >
>
> I think it would have been better if you could have given examples for
> such cases where you need this extra logging.  Anyway, below are few
> minor comments on this patch:
>
> 1.
> + /*
> + * Log any pending invalidations which are adding between the last
> + * command counter increment and the commit.
> + */
> + if (XLogLogicalInfoActive())
> + LogLogicalInvalidations();
>
> I think we can change this comment slightly and extend a bit to say
> for which kind of special cases we are adding this. "Log any pending
> invalidations which are added between the last CommandCounterIncrement
> and the commit.  Normally for DDLs, we log this at each command end,
> however for certain cases where we directly update the system table
> the invalidations were not logged at command end."
>
> Something like above based on cases that are not covered by command
> end WAL logging.
>
> 2.
> + * Emit WAL for invalidations.  This is currently only used for logging
> + * invalidations at the command end.
> + */
> +void
> +LogLogicalInvalidations()
>
> After this is getting used at a new place, it is better to modify the
> above comment to something like: "Emit WAL for invalidations.  This is
> currently only used for logging invalidations at the command end or at
> commit time if any invalidations are pending."
>

I have done some more review and below are my comments:

Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
----------------------------------------------------------------------------------------------
1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';

+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';

If we are going to add a new streaming API for get_changes, don't we
need for pg_logical_slot_get_binary_changes,
pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
as well?  I was thinking why not add a new parameter (streaming
boolean) instead of adding the new APIs.  This could be an optional
parameter which if user doesn't specify will be considered as false.
We already have optional parameters for APIs like
pg_create_logical_replication_slot.

2. You forgot to update sgml/func.sgml.  This will be required even if
we decide to add a new parameter instead of a new API.

3.
+ /* If called has not asked for streaming changes then disable it. */
+ ctx->streaming &= streaming;

/If called/If the caller

4.
diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/

Why the patch contains this change?

5. If I apply the first six patches and run the regressions, it fails
primarily because streaming got enabled by default.  And then when I
applied this patch, the tests passed because it disables streaming by
default.  I think this should be patch 0007.

Replication Origins
------------------------------
I think we also need to conclude on origins related discussion [1].
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records.  The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called.  The other alternative is that we can ignore that
for now and once the usage is clear we can enhance it.  What do you
think?

[1] - https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Ajin Cherian
Дата:
I was going through this thread and testing and reviewing the patches, I think this is a great feature to have and one which customers would appreciate. I wanted to help out, and I saw a request for a test patch for a GUC to always enable streaming on logical replication. Here's one on top of patchset v31, just in case you still need it. By default the GUC is turned on, I ran the regression tests with it and didn't see any errors. 

thanks,
Ajin
Fujitsu Australia

On Wed, Jul 8, 2020 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have compared the changes logged at command end vs logged at commit
> > time.  I have ignored the invalidation for the transaction which has
> > any aborted subtransaction in it.  While testing this I found one
> > issue, the issue is that if there are some invalidation generated
> > between last command counter increment and the commit transaction then
> > those were not logged.  I have fixed the issue by logging the pending
> > invalidation in RecordTransactionCommit.  I will include the changes
> > in the next patch set.
> >
>
> I think it would have been better if you could have given examples for
> such cases where you need this extra logging.  Anyway, below are few
> minor comments on this patch:
>
> 1.
> + /*
> + * Log any pending invalidations which are adding between the last
> + * command counter increment and the commit.
> + */
> + if (XLogLogicalInfoActive())
> + LogLogicalInvalidations();
>
> I think we can change this comment slightly and extend a bit to say
> for which kind of special cases we are adding this. "Log any pending
> invalidations which are added between the last CommandCounterIncrement
> and the commit.  Normally for DDLs, we log this at each command end,
> however for certain cases where we directly update the system table
> the invalidations were not logged at command end."
>
> Something like above based on cases that are not covered by command
> end WAL logging.
>
> 2.
> + * Emit WAL for invalidations.  This is currently only used for logging
> + * invalidations at the command end.
> + */
> +void
> +LogLogicalInvalidations()
>
> After this is getting used at a new place, it is better to modify the
> above comment to something like: "Emit WAL for invalidations.  This is
> currently only used for logging invalidations at the command end or at
> commit time if any invalidations are pending."
>

I have done some more review and below are my comments:

Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
----------------------------------------------------------------------------------------------
1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';

+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';

If we are going to add a new streaming API for get_changes, don't we
need for pg_logical_slot_get_binary_changes,
pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
as well?  I was thinking why not add a new parameter (streaming
boolean) instead of adding the new APIs.  This could be an optional
parameter which if user doesn't specify will be considered as false.
We already have optional parameters for APIs like
pg_create_logical_replication_slot.

2. You forgot to update sgml/func.sgml.  This will be required even if
we decide to add a new parameter instead of a new API.

3.
+ /* If called has not asked for streaming changes then disable it. */
+ ctx->streaming &= streaming;

/If called/If the caller

4.
diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/

Why the patch contains this change?

5. If I apply the first six patches and run the regressions, it fails
primarily because streaming got enabled by default.  And then when I
applied this patch, the tests passed because it disables streaming by
default.  I think this should be patch 0007.

Replication Origins
------------------------------
I think we also need to conclude on origins related discussion [1].
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records.  The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called.  The other alternative is that we can ignore that
for now and once the usage is clear we can enhance it.  What do you
think?

[1] - https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> I was going through this thread and testing and reviewing the patches, I think this is a great feature to have and
onewhich customers would appreciate. I wanted to help out, and I saw a request for a test patch for a GUC to always
enablestreaming on logical replication. Here's one on top of patchset v31, just in case you still need it. By default
theGUC is turned on, I ran the regression tests with it and didn't see any errors. 
>

Thanks for showing the interest in patch.  How have you ensured that
streaming is happening?  I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.

I feel you need to add some DEBUG messages (or some other way) to
ensure that all existing and new test cases related to logical
decoding will perform the streaming.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Ajin Cherian
Дата:


On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for showing the interest in patch.  How have you ensured that
streaming is happening?  I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.


Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function ReorderBufferCanStream () and decides whether to stream or spill. Did I miss something?

    while (rb->size >= logical_decoding_work_mem * 1024L)
    {
        /*
         * Pick the largest transaction (or subtransaction) and evict it from
         * memory by streaming, if supported. Otherwise, spill to disk.
         */
        if (ReorderBufferCanStream(rb) &&
            (txn = ReorderBufferLargestTopTXN(rb)) != NULL)
        {
            /* we know there has to be one, because the size is not zero */
            Assert(txn && !txn->toptxn);
            Assert(txn->total_size > 0);
            Assert(rb->size >= txn->total_size);

            ReorderBufferStreamTXN(rb, txn);
        }
        else
        { 

I will also add debug and test as you suggested.

regards,
Ajin Cherian
Fujitsu Australia

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:
>>
>> Thanks for showing the interest in patch.  How have you ensured that
>> streaming is happening?  I don't think the proposed patch can ensure
>> it for every case because we also rely on logical_decoding_work_mem to
>> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
>> think with your patch it will allow streaming for cases where we have
>> large amount of WAL to decode.
>>
>
> Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function
ReorderBufferCanStream() and decides whether to stream or spill. Did I miss something?
 
>
>     while (rb->size >= logical_decoding_work_mem * 1024L)
>     {

There is a check before above loop:

ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
{
ReorderBufferTXN *txn;

/* bail out if we haven't exceeded the memory limit */
if (rb->size < logical_decoding_work_mem * 1024L)
return;

This will prevent the streaming/spill to occur.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >>
> >> Thanks for showing the interest in patch.  How have you ensured that
> >> streaming is happening?  I don't think the proposed patch can ensure
> >> it for every case because we also rely on logical_decoding_work_mem to
> >> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
> >> think with your patch it will allow streaming for cases where we have
> >> large amount of WAL to decode.
> >>
> >
> > Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function
ReorderBufferCanStream() and decides whether to stream or spill. Did I miss something?
 
> >
> >     while (rb->size >= logical_decoding_work_mem * 1024L)
> >     {
>
> There is a check before above loop:
>
> ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
> {
> ReorderBufferTXN *txn;
>
> /* bail out if we haven't exceeded the memory limit */
> if (rb->size < logical_decoding_work_mem * 1024L)
> return;
>
> This will prevent the streaming/spill to occur.

I think if the GUC is set then maybe we can bypass this check so that
it can try to stream every single change?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Thu, Jul 9, 2020 at 8:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >>
> > >> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >>
> > >> Thanks for showing the interest in patch.  How have you ensured that
> > >> streaming is happening?  I don't think the proposed patch can ensure
> > >> it for every case because we also rely on logical_decoding_work_mem to
> > >> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
> > >> think with your patch it will allow streaming for cases where we have
> > >> large amount of WAL to decode.
> > >>
> > >
> > > Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function
ReorderBufferCanStream() and decides whether to stream or spill. Did I miss something?
 
> > >
> > >     while (rb->size >= logical_decoding_work_mem * 1024L)
> > >     {
> >
> > There is a check before above loop:
> >
> > ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
> > {
> > ReorderBufferTXN *txn;
> >
> > /* bail out if we haven't exceeded the memory limit */
> > if (rb->size < logical_decoding_work_mem * 1024L)
> > return;
> >
> > This will prevent the streaming/spill to occur.
>
> I think if the GUC is set then maybe we can bypass this check so that
> it can try to stream every single change?
>

Yeah and probably we need to do something for the check "while
(rb->size >= logical_decoding_work_mem * 1024L)" as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jul 8, 2020 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have compared the changes logged at command end vs logged at commit
> > > time.  I have ignored the invalidation for the transaction which has
> > > any aborted subtransaction in it.  While testing this I found one
> > > issue, the issue is that if there are some invalidation generated
> > > between last command counter increment and the commit transaction then
> > > those were not logged.  I have fixed the issue by logging the pending
> > > invalidation in RecordTransactionCommit.  I will include the changes
> > > in the next patch set.
> > >
> >
> > I think it would have been better if you could have given examples for
> > such cases where you need this extra logging.  Anyway, below are few
> > minor comments on this patch:
> >
> > 1.
> > + /*
> > + * Log any pending invalidations which are adding between the last
> > + * command counter increment and the commit.
> > + */
> > + if (XLogLogicalInfoActive())
> > + LogLogicalInvalidations();
> >
> > I think we can change this comment slightly and extend a bit to say
> > for which kind of special cases we are adding this. "Log any pending
> > invalidations which are added between the last CommandCounterIncrement
> > and the commit.  Normally for DDLs, we log this at each command end,
> > however for certain cases where we directly update the system table
> > the invalidations were not logged at command end."
> >
> > Something like above based on cases that are not covered by command
> > end WAL logging.
> >
> > 2.
> > + * Emit WAL for invalidations.  This is currently only used for logging
> > + * invalidations at the command end.
> > + */
> > +void
> > +LogLogicalInvalidations()
> >
> > After this is getting used at a new place, it is better to modify the
> > above comment to something like: "Emit WAL for invalidations.  This is
> > currently only used for logging invalidations at the command end or at
> > commit time if any invalidations are pending."
> >
>
> I have done some more review and below are my comments:
>
> Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
> ----------------------------------------------------------------------------------------------
> 1.
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
>  VOLATILE ROWS 1000 COST 1000
>  AS 'pg_logical_slot_get_changes';
>
> +CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
> +    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
> VARIADIC options text[] DEFAULT '{}',
> +    OUT lsn pg_lsn, OUT xid xid, OUT data text)
> +RETURNS SETOF RECORD
> +LANGUAGE INTERNAL
> +VOLATILE ROWS 1000 COST 1000
> +AS 'pg_logical_slot_get_streaming_changes';
>
> If we are going to add a new streaming API for get_changes, don't we
> need for pg_logical_slot_get_binary_changes,
> pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
> as well?  I was thinking why not add a new parameter (streaming
> boolean) instead of adding the new APIs.  This could be an optional
> parameter which if user doesn't specify will be considered as false.
> We already have optional parameters for APIs like
> pg_create_logical_replication_slot.
>
> 2. You forgot to update sgml/func.sgml.  This will be required even if
> we decide to add a new parameter instead of a new API.
>
> 3.
> + /* If called has not asked for streaming changes then disable it. */
> + ctx->streaming &= streaming;
>
> /If called/If the caller
>
> 4.
> diff --git a/.gitignore b/.gitignore
> index 794e35b..6083744 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -42,3 +42,4 @@ lib*.pc
>  /Debug/
>  /Release/
>  /tmp_install/
> +/build/
>
> Why the patch contains this change?
>
> 5. If I apply the first six patches and run the regressions, it fails
> primarily because streaming got enabled by default.  And then when I
> applied this patch, the tests passed because it disables streaming by
> default.  I think this should be patch 0007.

Only replying to the replication origin point, other comment looks
fine to me so I will work on those.

> Replication Origins
> ------------------------------
> I think we also need to conclude on origins related discussion [1].
> As far as I can see, the origin_id can be sent with the first startup
> message. The origin_lsn and origin_commit can be sent with the last
> start of streaming commit if we want but not sure if that is of use.
> If we need to send it earlier then we need to record it with other WAL
> records.  The point is that those are set with
> pg_replication_origin_xact_setup but not sure how and when that
> function is called.

pg_replication_origin_xact_setup is exposed function so this will
allow a user to set an origin for their session so that all the
operation done from that session will be marked by that origin id.
And the clear use case for this is to avoid sending such transactions
by suing FilterByOrigin.   But I am not sure about the point that we
discussed at [1] that what is the use of the origin and origin_lsn we
send at pgoutput_begin_txn.

 The other alternative is that we can ignore that
> for now and once the usage is clear we can enhance it.  What do you
> think?

That seems like a sensible option to me.

> [1] - https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Thu, Jul 9, 2020 at 2:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 8, 2020 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> Only replying to the replication origin point, other comment looks
> fine to me so I will work on those.
>
> > Replication Origins
> > ------------------------------
> > I think we also need to conclude on origins related discussion [1].
> > As far as I can see, the origin_id can be sent with the first startup
> > message. The origin_lsn and origin_commit can be sent with the last
> > start of streaming commit if we want but not sure if that is of use.
> > If we need to send it earlier then we need to record it with other WAL
> > records.  The point is that those are set with
> > pg_replication_origin_xact_setup but not sure how and when that
> > function is called.
>
> pg_replication_origin_xact_setup is exposed function so this will
> allow a user to set an origin for their session so that all the
> operation done from that session will be marked by that origin id.
>

Hmm, I think that can be done by pg_replication_origin_session_setup.

> And the clear use case for this is to avoid sending such transactions
> by suing FilterByOrigin.   But I am not sure about the point that we
> discussed at [1] that what is the use of the origin and origin_lsn we
> send at pgoutput_begin_txn.
>

I could see the use of 'origin' with FilterByOrigin but not sure how
origin_lsn can be used?

>  The other alternative is that we can ignore that
> > for now and once the usage is clear we can enhance it.  What do you
> > think?
>
> That seems like a sensible option to me.
>

I have responded to that another thread.  Let us see if someone
responds to it.  Feel free to add if you have some points related to
that thread.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Ajin Cherian
Дата:


On Thu, Jul 9, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> I think if the GUC is set then maybe we can bypass this check so that
> it can try to stream every single change?
>

Yeah and probably we need to do something for the check "while
(rb->size >= logical_decoding_work_mem * 1024L)" as well.


I have made this change, as discussed, the regression tests seem to run fine. I have added a debug that records the streaming for each transaction number. I also had to bypass certain asserts in ReorderBufferLargestTopTXN() as now we are going through the entire list of transactions and not just picking the biggest transaction .

regards,
Ajin 
Fujitsu Australia
Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Fri, Jul 10, 2020 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
>
>
> On Thu, Jul 9, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>
>> > I think if the GUC is set then maybe we can bypass this check so that
>> > it can try to stream every single change?
>> >
>>
>> Yeah and probably we need to do something for the check "while
>> (rb->size >= logical_decoding_work_mem * 1024L)" as well.
>>
>>
> I have made this change, as discussed, the regression tests seem to run fine. I have added a debug that records the
streamingfor each transaction >number. I also had to bypass certain asserts in ReorderBufferLargestTopTXN() as now we
aregoing through the entire list of transactions and not just picking the biggest transaction . 

So if always_stream_logical is true then we are always going for the
streaming even if the size is not reached and that is good.  And if
always_stream_logical is set then we are setting ctx->streaming=true
that is also good.  So now I don't think we need to change this part
of the code, because when we bypass the memory limit and set the
ctx->streaming=true it will always select the streaming option unless
it is impossible.  With your changes sometimes due to incomplete toast
changes, if it can not pick the largest top txn for streaming it will
hang forever in the while loop, in that case, it should go for
spilling.

while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Ajin Cherian
Дата:


On Fri, Jul 10, 2020 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
With your changes sometimes due to incomplete toast
changes, if it can not pick the largest top txn for streaming it will
hang forever in the while loop, in that case, it should go for
spilling.

while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)



Which is this condition (of not picking largest top txn)? Wouldn't ReorderBufferLargestTopTXN then return a NULL? If not, is there a way to know that a transaction cannot be streamed, so there can be an exit condition for the while loop? 

regards,
Ajin Cherian
Fujitsu Australia

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Fri, Jul 10, 2020 at 11:01 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
>
>
> On Fri, Jul 10, 2020 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> With your changes sometimes due to incomplete toast
>> changes, if it can not pick the largest top txn for streaming it will
>> hang forever in the while loop, in that case, it should go for
>> spilling.
>>
>> while (rb->size >= logical_decoding_work_mem * 1024L)
>> {
>> /*
>> * Pick the largest transaction (or subtransaction) and evict it from
>> * memory by streaming, if supported. Otherwise, spill to disk.
>> */
>> if (ReorderBufferCanStream(rb) &&
>> (txn = ReorderBufferLargestTopTXN(rb)) != NULL)
>>
>>
>
> Which is this condition (of not picking largest top txn)? Wouldn't ReorderBufferLargestTopTXN then return a NULL? If
not,is there a way to know that a transaction cannot be streamed, so there can be an exit condition for the while
loop?


Okay, I see, so if ReorderBufferLargestTopTXN returns NULL you are
breaking the loop.  I did not see the other part of the patch but I
agree that it will not go in an infinite loop.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Let me know what you think about the above changes.
> >
>
> I went ahead and made few changes in
> 0005-Implement-streaming-mode-in-ReorderBuffer which are explained
> below.  I have few questions and suggestions for the patch as well
> which are also covered in below points.
>
> 1.
> + if (prev_lsn == InvalidXLogRecPtr)
> + {
> + if (streaming)
> + rb->stream_start(rb, txn, change->lsn);
> + else
> + rb->begin(rb, txn);
> + stream_started = true;
> + }
>
> I don't think we want to move begin callback here that will change the
> existing semantics, so it is better to move begin at its original
> position. I have made the required changes in the attached patch.
>
> 2.
> ReorderBufferTruncateTXN()
> {
> ..
> + dlist_foreach_modify(iter, &txn->changes)
> + {
> + ReorderBufferChange *change;
> +
> + change = dlist_container(ReorderBufferChange, node, iter.cur);
> +
> + /* remove the change from it's containing list */
> + dlist_delete(&change->node);
> +
> + ReorderBufferReturnChange(rb, change);
> + }
> ..
> }
>
> I think here we can add an Assert that we're not mixing changes from
> different transactions.  See the changes in the patch.
>
> 3.
> SetupCheckXidLive()
> {
> ..
> + /*
> + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access.  Also, reset the
> + * bsysscan flag.
> + */
> + if (!TransactionIdDidCommit(xid))
> + {
> + CheckXidAlive = xid;
> + bsysscan = false;
> ..
> }
>
> What is the need to reset bsysscan flag here if we are already
> resetting on error (like in the previous patch sent by me)?
>
> 4.
> ReorderBufferProcessTXN()
> {
> ..
> ..
> + /* Reset the CheckXidAlive */
> + if (streaming)
> + CheckXidAlive = InvalidTransactionId;
> ..
> }
>
> Similar to the previous point, we don't need this as well because
> AbortCurrentTransaction would have taken care of this.
>
> 5.
> + * XXX Do we need to check if the transaction has some changes to stream
> + * (maybe it got streamed right before the commit, which attempts to
> + * stream it again before the commit)?
> + */
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
>
> The above comment doesn't make much sense to me, so I have removed it.
> Basically, if there are no changes before commit, we still need to
> send commit and anyway if there are no more changes
> ReorderBufferProcessTXN will not do anything.
>
> 6.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> if (txn->snapshot_now == NULL)
> + {
> + dlist_iter subxact_i;
> +
> + /* make sure this transaction is streamed for the first time */
> + Assert(!rbtxn_is_streamed(txn));
> +
> + /* at the beginning we should have invalid command ID */
> + Assert(txn->command_id == InvalidCommandId);
> +
> + dlist_foreach(subxact_i, &txn->subtxns)
> + {
> + ReorderBufferTXN *subtxn;
> +
> + subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> + ReorderBufferTransferSnapToParent(txn, subtxn);
> + }
> ..
> }
>
> Here, it is possible that there is no base_snapshot for txn, so we
> need a check for that similar to ReorderBufferCommit.
>
> 7.  Apart from the above, I made few changes in comments and ran pgindent.
>
> 8. We can't stream the transaction before we reach the
> SNAPBUILD_CONSISTENT state because some other output plugin can apply
> those changes unlike what we do with pgoutput plugin (which writes to
> file). And, I think applying the transactions without reaching a
> consistent state would be anyway wrong.  So, we should avoid that and
> if do that then we should have an Assert for streamed txns rather than
> sending abort for them in ReorderBufferForget.

I was analyzing this point so currently, we only enable streaming in
StartReplicationSlot so basically in CreateReplicationSlot the
streaming will be always off because by that time plugins are not yet
startup that will happen only on StartReplicationSlot.  See below
snippet from patch 0007.  However, I agree during start replication
slot we might decode some of the extra walls of the transaction for
which we already got the commit confirmation and we must have a way to
avoid that.  But I think we don't need to do anything for the
CONSISTENT snapshot point.  What's your thought on this?

@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
  WalSndPrepareWrite, WalSndWriteData,
  WalSndUpdateProgress);

+ /*
+ * Make sure streaming is disabled here - we may have the methods,
+ * but we don't have anywhere to send the data yet.
+ */
+ ctx->streaming = false;
+

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > >
> > > > 9.
> > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > > {
> > > > ..
> > > > + ReorderBufferToastReset(rb, txn);
> > > > + if (specinsert != NULL)
> > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > ..
> > > > }
> > > >
> > > > Why do we need to do these here when we wouldn't have been done for
> > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > >
> > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > gracefully and we are continuing with further decoding so we need to
> > > return this change back.
> > >
> >
> > Okay, then I suggest we should do these before calling stream_stop and
> > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > pattern similar to try block unless there is a reason for not doing
> > so.  Also, it would be good if we can initialize specinsert with NULL
> > after returning the change as we are doing at other places.
>
> Okay
>
> > > > 10.  I have got the below failure once.  I have not investigated this
> > > > in detail as the patch is still under progress.  See, if you have any
> > > > idea?
> > > > #   Failed test 'check extra columns contain local defaults'
> > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > #          got: '2|0'
> > > > #     expected: '1000|500'
> > > > # Looks like you failed 1 test of 2.
> > > > make[2]: *** [check] Error 1
> > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > make[1]: *** Waiting for unfinished jobs....
> > > > make: *** [check-world-src/test-recurse] Error 2
> > >
> > > Even I got the failure once and after that, it did not reproduce.  I
> > > have executed it multiple time but it did not reproduce again.  Are
> > > you able to reproduce it consistently?
> > >
> >
> > No, I am also not able to reproduce it consistently but I think this
> > can fail if a subscriber sends the replay_location before actually
> > replaying the changes.  First, I thought that extra send_feedback we
> > have in apply_handle_stream_commit might have caused this but I guess
> > that can't happen because we need the commit time location for that
> > and we are storing the same at the end of apply_handle_stream_commit
> > after applying all messages.  I am not sure what is going on here.  I
> > think we somehow need to reproduce this or some variant of this test
> > consistently to find the root cause.
>
> And I think it appeared first time for me,  so maybe either induced
> from past few versions so some changes in the last few versions might
> have exposed it.  I have noticed that almost 50% of the time I am able
> to reproduce after the clean build so I can trace back from which
> version it started appearing that way it will be easy to narrow down.

I think the reason for the failure is that we are not setting
remote_final_lsn, in the streaming mode.  I have put multiple logs and
executed in log and from logs it appeared that some of the logical wal
did not get replayed due to below check in
should_apply_changes_for_rel.
return (rel->state == SUBREL_STATE_READY || (rel->state ==
SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));

I still need to do the detailed analysis that why does this fail in
some cases,  basically, most of the time the rel->state is
SUBREL_STATE_READY so this check passes but whenever the state is
SUBREL_STATE_SYNCDONE it failed because we never update
remote_final_lsn.  I will try to set this value in
apply_handle_stream_commit and see whether it ever fails or not.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 8. We can't stream the transaction before we reach the
> > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > those changes unlike what we do with pgoutput plugin (which writes to
> > file). And, I think applying the transactions without reaching a
> > consistent state would be anyway wrong.  So, we should avoid that and
> > if do that then we should have an Assert for streamed txns rather than
> > sending abort for them in ReorderBufferForget.
>
> I was analyzing this point so currently, we only enable streaming in
> StartReplicationSlot so basically in CreateReplicationSlot the
> streaming will be always off because by that time plugins are not yet
> startup that will happen only on StartReplicationSlot.
>

What do you mean by 'startup' in the above sentence?  AFAICS, we do
call startup_cb_wrapper in CreateInitDecodingContext which is called
from both CreateReplicationSlot and create_logical_replication_slot
before the start of decoding.  In CreateInitDecodingContext, we call
StartupDecodingContext which should load the plugin.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > 8. We can't stream the transaction before we reach the
> > > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > > those changes unlike what we do with pgoutput plugin (which writes to
> > > file). And, I think applying the transactions without reaching a
> > > consistent state would be anyway wrong.  So, we should avoid that and
> > > if do that then we should have an Assert for streamed txns rather than
> > > sending abort for them in ReorderBufferForget.
> >
> > I was analyzing this point so currently, we only enable streaming in
> > StartReplicationSlot so basically in CreateReplicationSlot the
> > streaming will be always off because by that time plugins are not yet
> > startup that will happen only on StartReplicationSlot.
> >
>
> What do you mean by 'startup' in the above sentence?  AFAICS, we do
> call startup_cb_wrapper in CreateInitDecodingContext which is called
> from both CreateReplicationSlot and create_logical_replication_slot
> before the start of decoding.  In CreateInitDecodingContext, we call
> StartupDecodingContext which should load the plugin.

Yeah, you are right that we do call startup_cb_wrapper from
CreateInitDecodingContext as well.  I think I got confused by below
comment in patch 0007

@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
WalSndPrepareWrite, WalSndWriteData,
WalSndUpdateProgress);
+ /*
+ * Make sure streaming is disabled here - we may have the methods,
+ * but we don't have anywhere to send the data yet.
+ */
+ ctx->streaming = false;
+

Basically, during CreateReplicationSlot we forcefully disable the
streaming with the comment "we don't have anywhere to send the data
yet".  So my point is during CreateReplicationSlot time the streaming
will always be off and once we are done with creating the slot we will
be having consistent snapshot.  So my point is can we just check that
while decoding unless the current LSN reaches the start_decoding_at
point we should not start streaming and after that we can start.  At
that time we can have an assert that the snapshot should be
CONSISTENT.  However, before doing that I need to check on this point
that why after creating slot we are setting ctx->streaming to false.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > >
> > > > > 9.
> > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > > > {
> > > > > ..
> > > > > + ReorderBufferToastReset(rb, txn);
> > > > > + if (specinsert != NULL)
> > > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > > ..
> > > > > }
> > > > >
> > > > > Why do we need to do these here when we wouldn't have been done for
> > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > > >
> > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > > gracefully and we are continuing with further decoding so we need to
> > > > return this change back.
> > > >
> > >
> > > Okay, then I suggest we should do these before calling stream_stop and
> > > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > > pattern similar to try block unless there is a reason for not doing
> > > so.  Also, it would be good if we can initialize specinsert with NULL
> > > after returning the change as we are doing at other places.
> >
> > Okay
> >
> > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > idea?
> > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > #          got: '2|0'
> > > > > #     expected: '1000|500'
> > > > > # Looks like you failed 1 test of 2.
> > > > > make[2]: *** [check] Error 1
> > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > make[1]: *** Waiting for unfinished jobs....
> > > > > make: *** [check-world-src/test-recurse] Error 2
> > > >
> > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > have executed it multiple time but it did not reproduce again.  Are
> > > > you able to reproduce it consistently?
> > > >
> > >
> > > No, I am also not able to reproduce it consistently but I think this
> > > can fail if a subscriber sends the replay_location before actually
> > > replaying the changes.  First, I thought that extra send_feedback we
> > > have in apply_handle_stream_commit might have caused this but I guess
> > > that can't happen because we need the commit time location for that
> > > and we are storing the same at the end of apply_handle_stream_commit
> > > after applying all messages.  I am not sure what is going on here.  I
> > > think we somehow need to reproduce this or some variant of this test
> > > consistently to find the root cause.
> >
> > And I think it appeared first time for me,  so maybe either induced
> > from past few versions so some changes in the last few versions might
> > have exposed it.  I have noticed that almost 50% of the time I am able
> > to reproduce after the clean build so I can trace back from which
> > version it started appearing that way it will be easy to narrow down.
>
> I think the reason for the failure is that we are not setting
> remote_final_lsn, in the streaming mode.  I have put multiple logs and
> executed in log and from logs it appeared that some of the logical wal
> did not get replayed due to below check in
> should_apply_changes_for_rel.
> return (rel->state == SUBREL_STATE_READY || (rel->state ==
> SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
>
> I still need to do the detailed analysis that why does this fail in
> some cases,  basically, most of the time the rel->state is
> SUBREL_STATE_READY so this check passes but whenever the state is
> SUBREL_STATE_SYNCDONE it failed because we never update
> remote_final_lsn.  I will try to set this value in
> apply_handle_stream_commit and see whether it ever fails or not.

I have verified that after setting the remote_final_lsn in the
apply_handle_stream_commit, I don't see that regression failure in
over 70 runs whereas without that change it failed 6 times in 50 runs.
Apart from this, I have noticed one more thing related to the same
point.  Basically, in the apply_handle_commit, we are calling
process_syncing_tables whereas we are not calling the same in
apply_handle_stream_commit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Jul 13, 2020 at 10:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > 8. We can't stream the transaction before we reach the
> > > > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > > > those changes unlike what we do with pgoutput plugin (which writes to
> > > > file). And, I think applying the transactions without reaching a
> > > > consistent state would be anyway wrong.  So, we should avoid that and
> > > > if do that then we should have an Assert for streamed txns rather than
> > > > sending abort for them in ReorderBufferForget.
> > >
> > > I was analyzing this point so currently, we only enable streaming in
> > > StartReplicationSlot so basically in CreateReplicationSlot the
> > > streaming will be always off because by that time plugins are not yet
> > > startup that will happen only on StartReplicationSlot.
> > >
> >
> > What do you mean by 'startup' in the above sentence?  AFAICS, we do
> > call startup_cb_wrapper in CreateInitDecodingContext which is called
> > from both CreateReplicationSlot and create_logical_replication_slot
> > before the start of decoding.  In CreateInitDecodingContext, we call
> > StartupDecodingContext which should load the plugin.
>
> Yeah, you are right that we do call startup_cb_wrapper from
> CreateInitDecodingContext as well.  I think I got confused by below
> comment in patch 0007
>
> @@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
> WalSndPrepareWrite, WalSndWriteData,
> WalSndUpdateProgress);
> + /*
> + * Make sure streaming is disabled here - we may have the methods,
> + * but we don't have anywhere to send the data yet.
> + */
> + ctx->streaming = false;
> +
>
> Basically, during CreateReplicationSlot we forcefully disable the
> streaming with the comment "we don't have anywhere to send the data
> yet".  So my point is during CreateReplicationSlot time the streaming
> will always be off and once we are done with creating the slot we will
> be having consistent snapshot.  So my point is can we just check that
> while decoding unless the current LSN reaches the start_decoding_at
> point we should not start streaming and after that we can start.  At
> that time we can have an assert that the snapshot should be
> CONSISTENT.  However, before doing that I need to check on this point
> that why after creating slot we are setting ctx->streaming to false.
>

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.".  I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way.  I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way.  Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase.  So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > >
> > > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > > idea?
> > > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > > #          got: '2|0'
> > > > > > #     expected: '1000|500'
> > > > > > # Looks like you failed 1 test of 2.
> > > > > > make[2]: *** [check] Error 1
> > > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > > make[1]: *** Waiting for unfinished jobs....
> > > > > > make: *** [check-world-src/test-recurse] Error 2
> > > > >
> > > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > > have executed it multiple time but it did not reproduce again.  Are
> > > > > you able to reproduce it consistently?
> > > > >
> > > >
...
..
> >
> > I think the reason for the failure is that we are not setting
> > remote_final_lsn, in the streaming mode.  I have put multiple logs and
> > executed in log and from logs it appeared that some of the logical wal
> > did not get replayed due to below check in
> > should_apply_changes_for_rel.
> > return (rel->state == SUBREL_STATE_READY || (rel->state ==
> > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
> >
> > I still need to do the detailed analysis that why does this fail in
> > some cases,  basically, most of the time the rel->state is
> > SUBREL_STATE_READY so this check passes but whenever the state is
> > SUBREL_STATE_SYNCDONE it failed because we never update
> > remote_final_lsn.  I will try to set this value in
> > apply_handle_stream_commit and see whether it ever fails or not.
>
> I have verified that after setting the remote_final_lsn in the
> apply_handle_stream_commit, I don't see that regression failure in
> over 70 runs whereas without that change it failed 6 times in 50 runs.
>

Your analysis and fix seem correct to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 10:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > 8. We can't stream the transaction before we reach the
> > > > > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > > > > those changes unlike what we do with pgoutput plugin (which writes to
> > > > > file). And, I think applying the transactions without reaching a
> > > > > consistent state would be anyway wrong.  So, we should avoid that and
> > > > > if do that then we should have an Assert for streamed txns rather than
> > > > > sending abort for them in ReorderBufferForget.
> > > >
> > > > I was analyzing this point so currently, we only enable streaming in
> > > > StartReplicationSlot so basically in CreateReplicationSlot the
> > > > streaming will be always off because by that time plugins are not yet
> > > > startup that will happen only on StartReplicationSlot.
> > > >
> > >
> > > What do you mean by 'startup' in the above sentence?  AFAICS, we do
> > > call startup_cb_wrapper in CreateInitDecodingContext which is called
> > > from both CreateReplicationSlot and create_logical_replication_slot
> > > before the start of decoding.  In CreateInitDecodingContext, we call
> > > StartupDecodingContext which should load the plugin.
> >
> > Yeah, you are right that we do call startup_cb_wrapper from
> > CreateInitDecodingContext as well.  I think I got confused by below
> > comment in patch 0007
> >
> > @@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
> > WalSndPrepareWrite, WalSndWriteData,
> > WalSndUpdateProgress);
> > + /*
> > + * Make sure streaming is disabled here - we may have the methods,
> > + * but we don't have anywhere to send the data yet.
> > + */
> > + ctx->streaming = false;
> > +
> >
> > Basically, during CreateReplicationSlot we forcefully disable the
> > streaming with the comment "we don't have anywhere to send the data
> > yet".  So my point is during CreateReplicationSlot time the streaming
> > will always be off and once we are done with creating the slot we will
> > be having consistent snapshot.  So my point is can we just check that
> > while decoding unless the current LSN reaches the start_decoding_at
> > point we should not start streaming and after that we can start.  At
> > that time we can have an assert that the snapshot should be
> > CONSISTENT.  However, before doing that I need to check on this point
> > that why after creating slot we are setting ctx->streaming to false.
> >
>
> I think you can refer to commit message as well for that "We however
> must explicitly disable streaming replication during replication slot
> creation, even if the plugin supports it. We don't need to replicate
> the changes accumulated during this phase, and moreover, we don't have
> a replication connection open so we don't have where to send the data
> anyway.".  I don't think this is a good way to hack the streaming flag
> because for SQL API's, we don't have a good reason to disable the
> streaming in this way.  I guess if we had a condition related to
> reaching CONSISTENT snapshot during streaming then we won't need to
> hack the streaming flag in this way.  Once we reach the CONSISTENT
> snapshot state, we come out of the creation of a replication slot (see
> how we use DecodingContextReady to achieve that) phase.  So, I feel we
> should remove the ctx->streaming setting to false and add a CONSISTENT
> snapshot check during streaming unless you have a reason for not doing
> so.

I was worried about the point that streaming on/off is sent by the
subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
we keep streaming on during create then it may not be right.  But, I
agree with your point that it's better we can avoid streaming during
slot creation by CONSISTENT snapshot check instead of disabling this
way.  And, anyways as soon as we reach the consistent snapshot we will
stop processing further records so we will not attempt to stream
during slot creation.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I think you can refer to commit message as well for that "We however
> > must explicitly disable streaming replication during replication slot
> > creation, even if the plugin supports it. We don't need to replicate
> > the changes accumulated during this phase, and moreover, we don't have
> > a replication connection open so we don't have where to send the data
> > anyway.".  I don't think this is a good way to hack the streaming flag
> > because for SQL API's, we don't have a good reason to disable the
> > streaming in this way.  I guess if we had a condition related to
> > reaching CONSISTENT snapshot during streaming then we won't need to
> > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > snapshot state, we come out of the creation of a replication slot (see
> > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > should remove the ctx->streaming setting to false and add a CONSISTENT
> > snapshot check during streaming unless you have a reason for not doing
> > so.
>
> I was worried about the point that streaming on/off is sent by the
> subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> we keep streaming on during create then it may not be right.
>

Then, how is that used on the publisher-side?  AFAICS, the streaming
is enabled based on whether streaming callbacks are provided and we do
that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > I think you can refer to commit message as well for that "We however
> > > must explicitly disable streaming replication during replication slot
> > > creation, even if the plugin supports it. We don't need to replicate
> > > the changes accumulated during this phase, and moreover, we don't have
> > > a replication connection open so we don't have where to send the data
> > > anyway.".  I don't think this is a good way to hack the streaming flag
> > > because for SQL API's, we don't have a good reason to disable the
> > > streaming in this way.  I guess if we had a condition related to
> > > reaching CONSISTENT snapshot during streaming then we won't need to
> > > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > > snapshot state, we come out of the creation of a replication slot (see
> > > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > > should remove the ctx->streaming setting to false and add a CONSISTENT
> > > snapshot check during streaming unless you have a reason for not doing
> > > so.
> >
> > I was worried about the point that streaming on/off is sent by the
> > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> > we keep streaming on during create then it may not be right.
> >
>
> Then, how is that used on the publisher-side?  AFAICS, the streaming
> is enabled based on whether streaming callbacks are provided and we do
> that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.

Basically, first, we enable based on whether we have the callbacks or
not but later once we get the START REPLICATION command from the
subscriber then we set it to false if the streaming is not enabled
from the subscriber side.  You can refer below code in patch 0007.

pgoutput_startup
{
parse_output_parameters(ctx->output_plugin_options,
&data->protocol_version,
- &data->publication_names);
+ &data->publication_names,
+ &enable_streaming);
/* Check if we support requested protocol */
if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("publication_names parameter missing")));
+ /*
+ * Decide whether to enable streaming. It is disabled by default, in
+ * which case we just update the flag in decoding context. Otherwise
+ * we only allow it with sufficient version of the protocol, and when
+ * the output plugin supports it.
+ */
+ if (!enable_streaming)
+ ctx->streaming = false;
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > I think you can refer to commit message as well for that "We however
> > > > must explicitly disable streaming replication during replication slot
> > > > creation, even if the plugin supports it. We don't need to replicate
> > > > the changes accumulated during this phase, and moreover, we don't have
> > > > a replication connection open so we don't have where to send the data
> > > > anyway.".  I don't think this is a good way to hack the streaming flag
> > > > because for SQL API's, we don't have a good reason to disable the
> > > > streaming in this way.  I guess if we had a condition related to
> > > > reaching CONSISTENT snapshot during streaming then we won't need to
> > > > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > > > snapshot state, we come out of the creation of a replication slot (see
> > > > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > > > should remove the ctx->streaming setting to false and add a CONSISTENT
> > > > snapshot check during streaming unless you have a reason for not doing
> > > > so.
> > >
> > > I was worried about the point that streaming on/off is sent by the
> > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> > > we keep streaming on during create then it may not be right.
> > >
> >
> > Then, how is that used on the publisher-side?  AFAICS, the streaming
> > is enabled based on whether streaming callbacks are provided and we do
> > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.
>
> Basically, first, we enable based on whether we have the callbacks or
> not but later once we get the START REPLICATION command from the
> subscriber then we set it to false if the streaming is not enabled
> from the subscriber side.  You can refer below code in patch 0007.
>
> pgoutput_startup
> {
> parse_output_parameters(ctx->output_plugin_options,
> &data->protocol_version,
> - &data->publication_names);
> + &data->publication_names,
> + &enable_streaming);
> /* Check if we support requested protocol */
> if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
> @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
> OutputPluginOptions *opt,
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> errmsg("publication_names parameter missing")));
> + /*
> + * Decide whether to enable streaming. It is disabled by default, in
> + * which case we just update the flag in decoding context. Otherwise
> + * we only allow it with sufficient version of the protocol, and when
> + * the output plugin supports it.
> + */
> + if (!enable_streaming)
> + ctx->streaming = false;
> }
>

Okay, in that case, we can do both enable and disable streaming in
this function itself rather than allow the caller to later modify it.
I suggest similarly we can enable/disable it for SQL API in
pg_decode_startup via output_plugin_options.  This way it will look
consistent for both SQL APIs and for command-based replication.  If we
can do so, then probably adding an Assert for Consistent Snapshot
while performing streaming should be okay.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > I think you can refer to commit message as well for that "We however
> > > > > must explicitly disable streaming replication during replication slot
> > > > > creation, even if the plugin supports it. We don't need to replicate
> > > > > the changes accumulated during this phase, and moreover, we don't have
> > > > > a replication connection open so we don't have where to send the data
> > > > > anyway.".  I don't think this is a good way to hack the streaming flag
> > > > > because for SQL API's, we don't have a good reason to disable the
> > > > > streaming in this way.  I guess if we had a condition related to
> > > > > reaching CONSISTENT snapshot during streaming then we won't need to
> > > > > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > > > > snapshot state, we come out of the creation of a replication slot (see
> > > > > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > > > > should remove the ctx->streaming setting to false and add a CONSISTENT
> > > > > snapshot check during streaming unless you have a reason for not doing
> > > > > so.
> > > >
> > > > I was worried about the point that streaming on/off is sent by the
> > > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> > > > we keep streaming on during create then it may not be right.
> > > >
> > >
> > > Then, how is that used on the publisher-side?  AFAICS, the streaming
> > > is enabled based on whether streaming callbacks are provided and we do
> > > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.
> >
> > Basically, first, we enable based on whether we have the callbacks or
> > not but later once we get the START REPLICATION command from the
> > subscriber then we set it to false if the streaming is not enabled
> > from the subscriber side.  You can refer below code in patch 0007.
> >
> > pgoutput_startup
> > {
> > parse_output_parameters(ctx->output_plugin_options,
> > &data->protocol_version,
> > - &data->publication_names);
> > + &data->publication_names,
> > + &enable_streaming);
> > /* Check if we support requested protocol */
> > if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
> > @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
> > OutputPluginOptions *opt,
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > errmsg("publication_names parameter missing")));
> > + /*
> > + * Decide whether to enable streaming. It is disabled by default, in
> > + * which case we just update the flag in decoding context. Otherwise
> > + * we only allow it with sufficient version of the protocol, and when
> > + * the output plugin supports it.
> > + */
> > + if (!enable_streaming)
> > + ctx->streaming = false;
> > }
> >
>
> Okay, in that case, we can do both enable and disable streaming in
> this function itself rather than allow the caller to later modify it.
> I suggest similarly we can enable/disable it for SQL API in
> pg_decode_startup via output_plugin_options.  This way it will look
> consistent for both SQL APIs and for command-based replication.  If we
> can do so, then probably adding an Assert for Consistent Snapshot
> while performing streaming should be okay.

Sounds good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Jul 13, 2020 at 4:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Okay, in that case, we can do both enable and disable streaming in
> > this function itself rather than allow the caller to later modify it.
> > I suggest similarly we can enable/disable it for SQL API in
> > pg_decode_startup via output_plugin_options.  This way it will look
> > consistent for both SQL APIs and for command-based replication.  If we
> > can do so, then probably adding an Assert for Consistent Snapshot
> > while performing streaming should be okay.
>
> Sounds good to me.
>

Please find the latest patches.  I have made changes only in the
subscriber-side patches (0007 and 0008 as per the current patch-set).
The main changes are:
1. As discussed above, remove SendFeedback call from apply_handle_stream_commit
2. In SharedFilesetInit, ensure to register callback once
3. In stream_open_file, change slight handling around MemoryContexts
4. Merged the subscriber-side patches.
5. Added/Edited comments in 0007 and 0008.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jul 14, 2020 at 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 4:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Okay, in that case, we can do both enable and disable streaming in
> > > this function itself rather than allow the caller to later modify it.
> > > I suggest similarly we can enable/disable it for SQL API in
> > > pg_decode_startup via output_plugin_options.  This way it will look
> > > consistent for both SQL APIs and for command-based replication.  If we
> > > can do so, then probably adding an Assert for Consistent Snapshot
> > > while performing streaming should be okay.
> >
> > Sounds good to me.
> >
>
> Please find the latest patches.  I have made changes only in the
> subscriber-side patches (0007 and 0008 as per the current patch-set).
> The main changes are:
> 1. As discussed above, remove SendFeedback call from apply_handle_stream_commit
> 2. In SharedFilesetInit, ensure to register callback once
> 3. In stream_open_file, change slight handling around MemoryContexts
> 4. Merged the subscriber-side patches.
> 5. Added/Edited comments in 0007 and 0008.

I have reviewed your changes and those look good to me,  please find
the latest version of the patch set.  The major changes
- A couple of review comments fixed suggested upthread in 0003 and 0005.
- Handle the case of stop streaming until we reach to the
start_decoding_at LSN in 0005
- Simplified the 0006 by avoiding sending the transaction with
incomplete changes and added the comment atop
ReorderBufferLargestTopTXN
- Moved 0010 as 0007 and handled pending comments in the same.
- In 0009 I have fixed a couple of defects mentioned above.  And, one
additional defect that is,  if we do alter subscription streaming
off/on then it was not working.
- In 0009 sending the origin id.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > I think you can refer to commit message as well for that "We however
> > > > > must explicitly disable streaming replication during replication slot
> > > > > creation, even if the plugin supports it. We don't need to replicate
> > > > > the changes accumulated during this phase, and moreover, we don't have
> > > > > a replication connection open so we don't have where to send the data
> > > > > anyway.".  I don't think this is a good way to hack the streaming flag
> > > > > because for SQL API's, we don't have a good reason to disable the
> > > > > streaming in this way.  I guess if we had a condition related to
> > > > > reaching CONSISTENT snapshot during streaming then we won't need to
> > > > > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > > > > snapshot state, we come out of the creation of a replication slot (see
> > > > > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > > > > should remove the ctx->streaming setting to false and add a CONSISTENT
> > > > > snapshot check during streaming unless you have a reason for not doing
> > > > > so.
> > > >
> > > > I was worried about the point that streaming on/off is sent by the
> > > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> > > > we keep streaming on during create then it may not be right.
> > > >
> > >
> > > Then, how is that used on the publisher-side?  AFAICS, the streaming
> > > is enabled based on whether streaming callbacks are provided and we do
> > > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.
> >
> > Basically, first, we enable based on whether we have the callbacks or
> > not but later once we get the START REPLICATION command from the
> > subscriber then we set it to false if the streaming is not enabled
> > from the subscriber side.  You can refer below code in patch 0007.
> >
> > pgoutput_startup
> > {
> > parse_output_parameters(ctx->output_plugin_options,
> > &data->protocol_version,
> > - &data->publication_names);
> > + &data->publication_names,
> > + &enable_streaming);
> > /* Check if we support requested protocol */
> > if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
> > @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
> > OutputPluginOptions *opt,
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > errmsg("publication_names parameter missing")));
> > + /*
> > + * Decide whether to enable streaming. It is disabled by default, in
> > + * which case we just update the flag in decoding context. Otherwise
> > + * we only allow it with sufficient version of the protocol, and when
> > + * the output plugin supports it.
> > + */
> > + if (!enable_streaming)
> > + ctx->streaming = false;
> > }
> >
>
> Okay, in that case, we can do both enable and disable streaming in
> this function itself rather than allow the caller to later modify it.
> I suggest similarly we can enable/disable it for SQL API in
> pg_decode_startup via output_plugin_options.  This way it will look
> consistent for both SQL APIs and for command-based replication.  If we
> can do so, then probably adding an Assert for Consistent Snapshot
> while performing streaming should be okay.

Done this way In the latest patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > >
> > > > > > 9.
> > > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > > > > {
> > > > > > ..
> > > > > > + ReorderBufferToastReset(rb, txn);
> > > > > > + if (specinsert != NULL)
> > > > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > > > ..
> > > > > > }
> > > > > >
> > > > > > Why do we need to do these here when we wouldn't have been done for
> > > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > > > >
> > > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > > > gracefully and we are continuing with further decoding so we need to
> > > > > return this change back.
> > > > >
> > > >
> > > > Okay, then I suggest we should do these before calling stream_stop and
> > > > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > > > pattern similar to try block unless there is a reason for not doing
> > > > so.  Also, it would be good if we can initialize specinsert with NULL
> > > > after returning the change as we are doing at other places.
> > >
> > > Okay
> > >
> > > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > > idea?
> > > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > > #          got: '2|0'
> > > > > > #     expected: '1000|500'
> > > > > > # Looks like you failed 1 test of 2.
> > > > > > make[2]: *** [check] Error 1
> > > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > > make[1]: *** Waiting for unfinished jobs....
> > > > > > make: *** [check-world-src/test-recurse] Error 2
> > > > >
> > > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > > have executed it multiple time but it did not reproduce again.  Are
> > > > > you able to reproduce it consistently?
> > > > >
> > > >
> > > > No, I am also not able to reproduce it consistently but I think this
> > > > can fail if a subscriber sends the replay_location before actually
> > > > replaying the changes.  First, I thought that extra send_feedback we
> > > > have in apply_handle_stream_commit might have caused this but I guess
> > > > that can't happen because we need the commit time location for that
> > > > and we are storing the same at the end of apply_handle_stream_commit
> > > > after applying all messages.  I am not sure what is going on here.  I
> > > > think we somehow need to reproduce this or some variant of this test
> > > > consistently to find the root cause.
> > >
> > > And I think it appeared first time for me,  so maybe either induced
> > > from past few versions so some changes in the last few versions might
> > > have exposed it.  I have noticed that almost 50% of the time I am able
> > > to reproduce after the clean build so I can trace back from which
> > > version it started appearing that way it will be easy to narrow down.
> >
> > I think the reason for the failure is that we are not setting
> > remote_final_lsn, in the streaming mode.  I have put multiple logs and
> > executed in log and from logs it appeared that some of the logical wal
> > did not get replayed due to below check in
> > should_apply_changes_for_rel.
> > return (rel->state == SUBREL_STATE_READY || (rel->state ==
> > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
> >
> > I still need to do the detailed analysis that why does this fail in
> > some cases,  basically, most of the time the rel->state is
> > SUBREL_STATE_READY so this check passes but whenever the state is
> > SUBREL_STATE_SYNCDONE it failed because we never update
> > remote_final_lsn.  I will try to set this value in
> > apply_handle_stream_commit and see whether it ever fails or not.
>
> I have verified that after setting the remote_final_lsn in the
> apply_handle_stream_commit, I don't see that regression failure in
> over 70 runs whereas without that change it failed 6 times in 50 runs.
> Apart from this, I have noticed one more thing related to the same
> point.  Basically, in the apply_handle_commit, we are calling
> process_syncing_tables whereas we are not calling the same in
 > apply_handle_stream_commit.

I have set the remote_final_lsn as well as called
process_syncing_tables, in apply_handle_stream_commit.  Please see the
latest patch set v33.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Ajin Cherian
Дата:


On Wed, Jul 15, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
  Please see the
latest patch set v33.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



I have a minor comment. You've defined a new function ReorderBufferStartStreaming() but the function doesn't actually start streaming but is used to find out if you can start streaming and it returns a boolean. Can't you name it accordingly? 
Probably ReorderBufferCanStartStreaming(). I understand that it internally calls ReorderBufferCanStream() which is similar sounding but I think that should not matter.

regards,
Ajin Cherian
Fujitsu Australia

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Jul 15, 2020 at 4:51 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>>   Please see the
>> latest patch set v33.
>>
>>
>>
>
> I have a minor comment. You've defined a new function ReorderBufferStartStreaming() but the function doesn't actually
startstreaming but is used to find out if you can start streaming and it returns a boolean. Can't you name it
accordingly?
> Probably ReorderBufferCanStartStreaming(). I understand that it internally calls ReorderBufferCanStream() which is
similarsounding but I think that should not matter.
 
>

+1.  I am actually editing some of the patches and I have already
named it as you are suggesting.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Jul 15, 2020 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> I have reviewed your changes and those look good to me,  please find
> the latest version of the patch set.
>

I have done an additional round of review and below are the changes I
made in the attached patch-set.
1. Changed comments in 0002.
2. In 0005, apart from changing a few comments and function name, I
have changed below code:
+ if (ReorderBufferCanStream(rb) &&
+ !SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr))
Here, I think it is better to compare it with EndRecPtr.  I feel in
boundary case the next record could be the same as start_decoding_at,
so why to avoid streaming in that case?
3. In 0006, made below changes:
    a. Removed function ReorderBufferFreeChange and added a new
parameter in ReorderBufferReturnChange to achieve the same purpose.
    b. Changed quite a few comments, function names, added additional
Asserts, and few other cosmetic changes.
4. In 0007, made below changes:
    a. Removed the unnecessary change in .gitignore
    b. Changed the newly added option name to "stream-change".

Apart from above, I have merged patches 0004, 0005, 0006 and 0007 as
those seems one functionality to me.  For the sake of review, the
patch-set that contains merged patches is attached separately as
v34-combined.

Let me know what you think of the changes?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > I have reviewed your changes and those look good to me,  please find
> > the latest version of the patch set.
> >
>
> I have done an additional round of review and below are the changes I
> made in the attached patch-set.
> 1. Changed comments in 0002.
> 2. In 0005, apart from changing a few comments and function name, I
> have changed below code:
> + if (ReorderBufferCanStream(rb) &&
> + !SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr))
> Here, I think it is better to compare it with EndRecPtr.  I feel in
> boundary case the next record could be the same as start_decoding_at,
> so why to avoid streaming in that case?

Make sense to me

> 3. In 0006, made below changes:
>     a. Removed function ReorderBufferFreeChange and added a new
> parameter in ReorderBufferReturnChange to achieve the same purpose.
>     b. Changed quite a few comments, function names, added additional
> Asserts, and few other cosmetic changes.
> 4. In 0007, made below changes:
>     a. Removed the unnecessary change in .gitignore
>     b. Changed the newly added option name to "stream-change".
>
> Apart from above, I have merged patches 0004, 0005, 0006 and 0007 as
> those seems one functionality to me.  For the sake of review, the
> patch-set that contains merged patches is attached separately as
> v34-combined.
>
> Let me know what you think of the changes?

I have reviewed the changes and looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Let me know what you think of the changes?
>
> I have reviewed the changes and looks fine to me.
>

Thanks, I am planning to start committing a few of the infrastructure
patches (especially first two) by early next week as we have resolved
all the open issues and done an extensive review of the entire
patch-set.  In the attached version, there is a slight change in one
of the commit messages as compared to the previous version.  I would
like to describe in brief the first two patches for the sake of
convenience.  Let me know if you or anyone else sees any problems with
these.

The first patch in the series allows us to WAL-log subtransaction and
top-level XID association.  The logical decoding infrastructure needs
to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.  So we also write the assignment info into WAL
immediately, as part of the next WAL record (to minimize overhead)
only when *wal_level=logical*.  We can not remove the existing
XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
the hot standby snapshot.

The second patch writes WAL for invalidations at command end with
wal_level=logical.  When wal_level=logical, write invalidations at
command end into WAL so that decoding can use this information.  This
patch is required to allow the streaming of in-progress transactions
in logical decoding.  We still add the invalidations to the cache and
write them to WAL at commit time in RecordTransactionCommit(). This
uses the existing XLOG_INVALIDATIONS xlog record type, from the
RM_STANDBY_ID resource manager (see LogStandbyInvalidations for
details).  So existing code relying on those invalidations (e.g. redo)
does not need to be changed. The invalidations written at command end
uses a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID
resource manager. See LogLogicalInvalidations for details.  These new
xlog records are ignored by existing redo procedures, which still rely
on the invalidations written to commit records.  The invalidations are
decoded and accumulated in top-transaction, and then executed during
replay.  This obviates the need to decode the invalidations as part of
a commit record.

The performance testing has shown that there is no performance penalty
with either of the patches but there is some additional WAL which in
most cases is 2-5% but in worst cases and for some specific DDL's it
is up to 15% with the second patch, however, that happens at
wal_level=logical only.  We have considered an alternative to blow up
all caches on any DDL in WALSenders and that will have both CPU and
network overhead.  For detailed results and analysis see [1][2].

[1] - https://www.postgresql.org/message-id/CAKYtNAqWkPpPFrdEbpPrCan3G_QAcankZarRKKd7cj6vQigM7w%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1L3PoiBw6uogB7jD5rmdT-GmEF4kOEccS1AWKuBhSkQkQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Let me know what you think of the changes?
> >
> > I have reviewed the changes and looks fine to me.
> >
>
> Thanks, I am planning to start committing a few of the infrastructure
> patches (especially first two) by early next week as we have resolved
> all the open issues and done an extensive review of the entire
> patch-set.  In the attached version, there is a slight change in one
> of the commit messages as compared to the previous version.  I would
> like to describe in brief the first two patches for the sake of
> convenience.  Let me know if you or anyone else sees any problems with
> these.
>
> The first patch in the series allows us to WAL-log subtransaction and
> top-level XID association.  The logical decoding infrastructure needs
> to know which top-level
> transaction the subxact belongs to, in order to decode all the
> changes. Until now that might be delayed until commit, due to the
> caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
> incremental decoding.  So we also write the assignment info into WAL
> immediately, as part of the next WAL record (to minimize overhead)
> only when *wal_level=logical*.  We can not remove the existing
> XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
> the hot standby snapshot.
>
> The second patch writes WAL for invalidations at command end with
> wal_level=logical.  When wal_level=logical, write invalidations at
> command end into WAL so that decoding can use this information.  This
> patch is required to allow the streaming of in-progress transactions
> in logical decoding.  We still add the invalidations to the cache and
> write them to WAL at commit time in RecordTransactionCommit(). This
> uses the existing XLOG_INVALIDATIONS xlog record type, from the
> RM_STANDBY_ID resource manager (see LogStandbyInvalidations for
> details).  So existing code relying on those invalidations (e.g. redo)
> does not need to be changed. The invalidations written at command end
> uses a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID
> resource manager. See LogLogicalInvalidations for details.  These new
> xlog records are ignored by existing redo procedures, which still rely
> on the invalidations written to commit records.  The invalidations are
> decoded and accumulated in top-transaction, and then executed during
> replay.  This obviates the need to decode the invalidations as part of
> a commit record.
>
> The performance testing has shown that there is no performance penalty
> with either of the patches but there is some additional WAL which in
> most cases is 2-5% but in worst cases and for some specific DDL's it
> is up to 15% with the second patch, however, that happens at
> wal_level=logical only.  We have considered an alternative to blow up
> all caches on any DDL in WALSenders and that will have both CPU and
> network overhead.  For detailed results and analysis see [1][2].
>
> [1] - https://www.postgresql.org/message-id/CAKYtNAqWkPpPFrdEbpPrCan3G_QAcankZarRKKd7cj6vQigM7w%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/CAA4eK1L3PoiBw6uogB7jD5rmdT-GmEF4kOEccS1AWKuBhSkQkQ%40mail.gmail.com
>

The patch set required to rebase after committing the binary format
option support in the create subscription command.  I have rebased the
patch set on the latest head and also added a test case to test
streaming in binary format.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > Let me know what you think of the changes?
> > >
> > > I have reviewed the changes and looks fine to me.
> > >
> >
> > Thanks, I am planning to start committing a few of the infrastructure
> > patches (especially first two) by early next week as we have resolved
> > all the open issues and done an extensive review of the entire
> > patch-set.  In the attached version, there is a slight change in one
> > of the commit messages as compared to the previous version.  I would
> > like to describe in brief the first two patches for the sake of
> > convenience.  Let me know if you or anyone else sees any problems with
> > these.
> >
> > The first patch in the series allows us to WAL-log subtransaction and
> > top-level XID association.  The logical decoding infrastructure needs
> > to know which top-level
> > transaction the subxact belongs to, in order to decode all the
> > changes. Until now that might be delayed until commit, due to the
> > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
> > incremental decoding.  So we also write the assignment info into WAL
> > immediately, as part of the next WAL record (to minimize overhead)
> > only when *wal_level=logical*.  We can not remove the existing
> > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
> > the hot standby snapshot.
> >

Pushed, this patch.

> >
>
> The patch set required to rebase after committing the binary format
> option support in the create subscription command.  I have rebased the
> patch set on the latest head and also added a test case to test
> streaming in binary format.
>

While going through commit 9de77b5453, I noticed below change:

@@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
        PQfreemem(pubnames_literal);
        pfree(pubnames_str);

+       if (options->proto.logical.binary &&
+           PQserverVersion(conn->streamConn) >= 140000)
+           appendStringInfoString(&cmd, ", binary 'true'");
+

Now, the similar change in this patch series is as below:

@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
  appendStringInfo(&cmd, "proto_version '%u'",
  options->proto.logical.proto_version);

+ if (options->proto.logical.streaming)
+ appendStringInfo(&cmd, ", streaming 'on'");
+

I think we also need a version check similar to commit 9de77b5453 to
ensure that we send the new option only when connected to a newer
version (>=14) primary server.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 20, 2020 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Let me know what you think of the changes?
> > > >
> > > > I have reviewed the changes and looks fine to me.
> > > >
> > >
> > > Thanks, I am planning to start committing a few of the infrastructure
> > > patches (especially first two) by early next week as we have resolved
> > > all the open issues and done an extensive review of the entire
> > > patch-set.  In the attached version, there is a slight change in one
> > > of the commit messages as compared to the previous version.  I would
> > > like to describe in brief the first two patches for the sake of
> > > convenience.  Let me know if you or anyone else sees any problems with
> > > these.
> > >
> > > The first patch in the series allows us to WAL-log subtransaction and
> > > top-level XID association.  The logical decoding infrastructure needs
> > > to know which top-level
> > > transaction the subxact belongs to, in order to decode all the
> > > changes. Until now that might be delayed until commit, due to the
> > > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
> > > incremental decoding.  So we also write the assignment info into WAL
> > > immediately, as part of the next WAL record (to minimize overhead)
> > > only when *wal_level=logical*.  We can not remove the existing
> > > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
> > > the hot standby snapshot.
> > >
>
> Pushed, this patch.
>
> > >
> >
> > The patch set required to rebase after committing the binary format
> > option support in the create subscription command.  I have rebased the
> > patch set on the latest head and also added a test case to test
> > streaming in binary format.
> >
>
> While going through commit 9de77b5453, I noticed below change:
>
> @@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
>         PQfreemem(pubnames_literal);
>         pfree(pubnames_str);
>
> +       if (options->proto.logical.binary &&
> +           PQserverVersion(conn->streamConn) >= 140000)
> +           appendStringInfoString(&cmd, ", binary 'true'");
> +
>
> Now, the similar change in this patch series is as below:
>
> @@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
>   appendStringInfo(&cmd, "proto_version '%u'",
>   options->proto.logical.proto_version);
>
> + if (options->proto.logical.streaming)
> + appendStringInfo(&cmd, ", streaming 'on'");
> +
>
> I think we also need a version check similar to commit 9de77b5453 to
> ensure that we send the new option only when connected to a newer
> version (>=14) primary server.

I have changed that in the attached patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Jul 20, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 20, 2020 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > Let me know what you think of the changes?
> > > > >
> > > > > I have reviewed the changes and looks fine to me.
> > > > >
> > > >
> > > > Thanks, I am planning to start committing a few of the infrastructure
> > > > patches (especially first two) by early next week as we have resolved
> > > > all the open issues and done an extensive review of the entire
> > > > patch-set.  In the attached version, there is a slight change in one
> > > > of the commit messages as compared to the previous version.  I would
> > > > like to describe in brief the first two patches for the sake of
> > > > convenience.  Let me know if you or anyone else sees any problems with
> > > > these.
> > > >
> > > > The first patch in the series allows us to WAL-log subtransaction and
> > > > top-level XID association.  The logical decoding infrastructure needs
> > > > to know which top-level
> > > > transaction the subxact belongs to, in order to decode all the
> > > > changes. Until now that might be delayed until commit, due to the
> > > > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
> > > > incremental decoding.  So we also write the assignment info into WAL
> > > > immediately, as part of the next WAL record (to minimize overhead)
> > > > only when *wal_level=logical*.  We can not remove the existing
> > > > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
> > > > the hot standby snapshot.
> > > >
> >
> > Pushed, this patch.
> >
> > > >
> > >
> > > The patch set required to rebase after committing the binary format
> > > option support in the create subscription command.  I have rebased the
> > > patch set on the latest head and also added a test case to test
> > > streaming in binary format.
> > >
> >
> > While going through commit 9de77b5453, I noticed below change:
> >
> > @@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
> >         PQfreemem(pubnames_literal);
> >         pfree(pubnames_str);
> >
> > +       if (options->proto.logical.binary &&
> > +           PQserverVersion(conn->streamConn) >= 140000)
> > +           appendStringInfoString(&cmd, ", binary 'true'");
> > +
> >
> > Now, the similar change in this patch series is as below:
> >
> > @@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
> >   appendStringInfo(&cmd, "proto_version '%u'",
> >   options->proto.logical.proto_version);
> >
> > + if (options->proto.logical.streaming)
> > + appendStringInfo(&cmd, ", streaming 'on'");
> > +
> >
> > I think we also need a version check similar to commit 9de77b5453 to
> > ensure that we send the new option only when connected to a newer
> > version (>=14) primary server.
>
> I have changed that in the attached patch.

There was one warning in release mode in the last version in 0004 so
attaching a new version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Ajin Cherian
Дата:


On Mon, Jul 20, 2020 at 11:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:


There was one warning in release mode in the last version in 0004 so
attaching a new version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Hello,

I have tried to rework the  patch which did the stats for the streaming of logical replication but based on the new logical replication stats framework developed by Masahiko-san and rebased by Amit in [1]. This uses v38 of the streaming logical update patch as well as the v1 of the stats framework patch as base. I will rebase this as the stats framework is updated. Let me know if you have any comments.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> There was one warning in release mode in the last version in 0004 so
> attaching a new version.
>

Today, I was reviewing patch
v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
small problem with it.

+ /*
+ * Execute the invalidations for xid-less transactions,
+ * otherwise, accumulate them so that they can be processed at
+ * the commit time.
+ */
+ if (!ctx->fast_forward)
+ {
+ if (TransactionIdIsValid(xid))
+ {
+ ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+   invals->nmsgs, invals->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+   buf->origptr);
+ }

I think we need to set ReorderBufferXidSetCatalogChanges even when
ctx->fast-forward is true because we are dependent on that flag for
snapshot build (see SnapBuildCommitTxn).  We are already doing the
same way in DecodeCommit where even though we skip adding
invalidations for fast-forward cases but we do set the flag to
indicate that this txn has catalog changes.  Is there any reason to do
things differently here?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > There was one warning in release mode in the last version in 0004 so
> > attaching a new version.
> >
>
> Today, I was reviewing patch
> v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
> small problem with it.
>
> + /*
> + * Execute the invalidations for xid-less transactions,
> + * otherwise, accumulate them so that they can be processed at
> + * the commit time.
> + */
> + if (!ctx->fast_forward)
> + {
> + if (TransactionIdIsValid(xid))
> + {
> + ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
> +   invals->nmsgs, invals->msgs);
> + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
> +   buf->origptr);
> + }
>
> I think we need to set ReorderBufferXidSetCatalogChanges even when
> ctx->fast-forward is true because we are dependent on that flag for
> snapshot build (see SnapBuildCommitTxn).  We are already doing the
> same way in DecodeCommit where even though we skip adding
> invalidations for fast-forward cases but we do set the flag to
> indicate that this txn has catalog changes.  Is there any reason to do
> things differently here?

I think it is wrong,  we should set the
ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > There was one warning in release mode in the last version in 0004 so
> > > attaching a new version.
> > >
> >
> > Today, I was reviewing patch
> > v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
> > small problem with it.
> >
> > + /*
> > + * Execute the invalidations for xid-less transactions,
> > + * otherwise, accumulate them so that they can be processed at
> > + * the commit time.
> > + */
> > + if (!ctx->fast_forward)
> > + {
> > + if (TransactionIdIsValid(xid))
> > + {
> > + ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
> > +   invals->nmsgs, invals->msgs);
> > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
> > +   buf->origptr);
> > + }
> >
> > I think we need to set ReorderBufferXidSetCatalogChanges even when
> > ctx->fast-forward is true because we are dependent on that flag for
> > snapshot build (see SnapBuildCommitTxn).  We are already doing the
> > same way in DecodeCommit where even though we skip adding
> > invalidations for fast-forward cases but we do set the flag to
> > indicate that this txn has catalog changes.  Is there any reason to do
> > things differently here?
>
> I think it is wrong,  we should set the
> ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.
>

Thanks for the change.  I have one more minor comment in the patch
0001-WAL-Log-invalidations-at-command-end-with-wal_le.

 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+ int nmsgs; /* number of shared inval msgs */
+ SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+} xl_xact_invalidations;

I see that we already have a structure xl_xact_invals in the code
which has the same members, so I think it is better to use that
instead of defining a new one.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Jul 22, 2020 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > There was one warning in release mode in the last version in 0004 so
> > > > attaching a new version.
> > > >
> > >
> > > Today, I was reviewing patch
> > > v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
> > > small problem with it.
> > >
> > > + /*
> > > + * Execute the invalidations for xid-less transactions,
> > > + * otherwise, accumulate them so that they can be processed at
> > > + * the commit time.
> > > + */
> > > + if (!ctx->fast_forward)
> > > + {
> > > + if (TransactionIdIsValid(xid))
> > > + {
> > > + ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
> > > +   invals->nmsgs, invals->msgs);
> > > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
> > > +   buf->origptr);
> > > + }
> > >
> > > I think we need to set ReorderBufferXidSetCatalogChanges even when
> > > ctx->fast-forward is true because we are dependent on that flag for
> > > snapshot build (see SnapBuildCommitTxn).  We are already doing the
> > > same way in DecodeCommit where even though we skip adding
> > > invalidations for fast-forward cases but we do set the flag to
> > > indicate that this txn has catalog changes.  Is there any reason to do
> > > things differently here?
> >
> > I think it is wrong,  we should set the
> > ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.
> >
>
> Thanks for the change.  I have one more minor comment in the patch
> 0001-WAL-Log-invalidations-at-command-end-with-wal_le.
>
>  /*
> + * Invalidations logged with wal_level=logical.
> + */
> +typedef struct xl_xact_invalidations
> +{
> + int nmsgs; /* number of shared inval msgs */
> + SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
> +} xl_xact_invalidations;
>
> I see that we already have a structure xl_xact_invals in the code
> which has the same members, so I think it is better to use that
> instead of defining a new one.

You are right.  I have changed it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> You are right.  I have changed it.
>

Thanks, I have pushed the second patch in this series which is
0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
patch.  I will continue working on remaining patches.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Thu, Jul 23, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > You are right.  I have changed it.
> >
>
> Thanks, I have pushed the second patch in this series which is
> 0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
> patch.  I will continue working on remaining patches.
>

I have reviewed and made a number of changes in the next patch which
extends the logical decoding output plugin API with stream methods.
(v41-0001-Extend-the-logical-decoding-output-plugin-API-wi).

1. I think we need handling of include_xids and include_timestamp but
not skip_empty_xacts in the new APIs, as of now, none of the options
were respected.  We need 'include_xids' handling because we need to
include xid with stream messages and similarly 'include_timestamp' for
stream commit messages.  OTOH, I think we never use streaming mode for
empty xacts, so we don't need to bother about skip_empty_xacts in
streaming APIs.
2. Then I made a number of changes in documentation, comments, and
other cosmetic changes.

Kindly review/test and let me know if you see any problems with the
above changes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Fri, Jul 24, 2020 at 5:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 23, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > You are right.  I have changed it.
> > >
> >
> > Thanks, I have pushed the second patch in this series which is
> > 0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
> > patch.  I will continue working on remaining patches.
> >
>
> I have reviewed and made a number of changes in the next patch which
> extends the logical decoding output plugin API with stream methods.
> (v41-0001-Extend-the-logical-decoding-output-plugin-API-wi).
>
> 1. I think we need handling of include_xids and include_timestamp but
> not skip_empty_xacts in the new APIs, as of now, none of the options
> were respected.  We need 'include_xids' handling because we need to
> include xid with stream messages and similarly 'include_timestamp' for
> stream commit messages.  OTOH, I think we never use streaming mode for
> empty xacts, so we don't need to bother about skip_empty_xacts in
> streaming APIs.
> 2. Then I made a number of changes in documentation, comments, and
> other cosmetic changes.
>
> Kindly review/test and let me know if you see any problems with the
> above changes.

Your changes look fine to me.  Additionally, I have changed a test
case of getting the streaming changes in 0002.  Instead of just
showing the count, I am showing that the transaction is actually
streaming.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Your changes look fine to me.  Additionally, I have changed a test
> case of getting the streaming changes in 0002.  Instead of just
> showing the count, I am showing that the transaction is actually
> streaming.
>

If you want to show the changes then there is no need to display 157
rows probably a few (10-15) should be sufficient.  If we can do that
by increasing the size of the row then good, otherwise, I think it is
better to retain the test to display the count.

Today, I have again looked at the first patch
(v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
find any more problems with it so planning to commit the same unless
you or someone else want to add more to it.   Just for ease of others,
"the next patch extends the logical decoding output plugin API with
stream methods".   It adds seven methods to the output plugin API,
adding support for streaming changes for large in-progress
transactions. The methods are stream_start, stream_stop, stream_abort,
stream_commit, stream_change, stream_message, and stream_truncate.
Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction (or subtransaction) is
incomplete and may be aborted later (which is something the regular
API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these new
stream methods.  The stream_start/start_stop are used to demarcate a
chunk of changes streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to
"allow the streaming mode in ReorderBuffer" will use these APIs.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Your changes look fine to me.  Additionally, I have changed a test
> > case of getting the streaming changes in 0002.  Instead of just
> > showing the count, I am showing that the transaction is actually
> > streaming.
> >
>
> If you want to show the changes then there is no need to display 157
> rows probably a few (10-15) should be sufficient.  If we can do that
> by increasing the size of the row then good, otherwise, I think it is
> better to retain the test to display the count.

I think in existing test cases also we are displaying multiple lines
e.g. toast.out is showing 235 rows.  But maybe I will try to reduce it
to the less number of rows.

> Today, I have again looked at the first patch
> (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
> find any more problems with it so planning to commit the same unless
> you or someone else want to add more to it.   Just for ease of others,
> "the next patch extends the logical decoding output plugin API with
> stream methods".   It adds seven methods to the output plugin API,
> adding support for streaming changes for large in-progress
> transactions. The methods are stream_start, stream_stop, stream_abort,
> stream_commit, stream_change, stream_message, and stream_truncate.
> Most of this is a simple extension of the existing methods, with the
> semantic difference that the transaction (or subtransaction) is
> incomplete and may be aborted later (which is something the regular
> API does not really need to deal with).
>
> This also extends the 'test_decoding' plugin, implementing these new
> stream methods.  The stream_start/start_stop are used to demarcate a
> chunk of changes streamed for a particular toplevel transaction.
>
> This commit simply adds these new APIs and the upcoming patch to
> "allow the streaming mode in ReorderBuffer" will use these APIs.

LGTM

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Your changes look fine to me.  Additionally, I have changed a test
> > > case of getting the streaming changes in 0002.  Instead of just
> > > showing the count, I am showing that the transaction is actually
> > > streaming.
> > >
> >
> > If you want to show the changes then there is no need to display 157
> > rows probably a few (10-15) should be sufficient.  If we can do that
> > by increasing the size of the row then good, otherwise, I think it is
> > better to retain the test to display the count.
>
> I think in existing test cases also we are displaying multiple lines
> e.g. toast.out is showing 235 rows.  But maybe I will try to reduce it
> to the less number of rows.

Changed, now only 27 rows.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > Today, I have again looked at the first patch
> > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
> > find any more problems with it so planning to commit the same unless
> > you or someone else want to add more to it.   Just for ease of others,
> > "the next patch extends the logical decoding output plugin API with
> > stream methods".   It adds seven methods to the output plugin API,
> > adding support for streaming changes for large in-progress
> > transactions. The methods are stream_start, stream_stop, stream_abort,
> > stream_commit, stream_change, stream_message, and stream_truncate.
> > Most of this is a simple extension of the existing methods, with the
> > semantic difference that the transaction (or subtransaction) is
> > incomplete and may be aborted later (which is something the regular
> > API does not really need to deal with).
> >
> > This also extends the 'test_decoding' plugin, implementing these new
> > stream methods.  The stream_start/start_stop are used to demarcate a
> > chunk of changes streamed for a particular toplevel transaction.
> >
> > This commit simply adds these new APIs and the upcoming patch to
> > "allow the streaming mode in ReorderBuffer" will use these APIs.
>
> LGTM
>

Pushed.  Feel free to submit the remaining patches.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Jul 28, 2020 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > > Today, I have again looked at the first patch
> > > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
> > > find any more problems with it so planning to commit the same unless
> > > you or someone else want to add more to it.   Just for ease of others,
> > > "the next patch extends the logical decoding output plugin API with
> > > stream methods".   It adds seven methods to the output plugin API,
> > > adding support for streaming changes for large in-progress
> > > transactions. The methods are stream_start, stream_stop, stream_abort,
> > > stream_commit, stream_change, stream_message, and stream_truncate.
> > > Most of this is a simple extension of the existing methods, with the
> > > semantic difference that the transaction (or subtransaction) is
> > > incomplete and may be aborted later (which is something the regular
> > > API does not really need to deal with).
> > >
> > > This also extends the 'test_decoding' plugin, implementing these new
> > > stream methods.  The stream_start/start_stop are used to demarcate a
> > > chunk of changes streamed for a particular toplevel transaction.
> > >
> > > This commit simply adds these new APIs and the upcoming patch to
> > > "allow the streaming mode in ReorderBuffer" will use these APIs.
> >
> > LGTM
> >
>
> Pushed.  Feel free to submit the remaining patches.

Thanks, please find the rebased patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Ajin Cherian
Дата:


On Wed, Jul 29, 2020 at 3:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:


Thanks, please find the rebased patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

I was running some tests on this patch. I was generally trying to see how the patch affects logical replication when doing bulk inserts. This issue has been raised in the past, for eg: this [1].
My test setup is:
1. Two postgres servers running - A and B
2. Create a pgbench setup on A. (pgbench -i -s 5 postgres)
3. replicate the 3 tables (schema only) on B.
4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, pgbench_branches and pgbench_tellers;
5. Three subscribers on B for the same tables. (streaming on and off based on the scenarios described below)

run pgbench with : pgbench -c 4 -T 100 postgres
While pgbench is running, Do a bulk insert on some other table not in the publication list (say t1); INSERT INTO t1 (select i FROM generate_series(1,10000000) i);

Four scenarios:
1. Pgbench with logical replication enabled without bulk insert
Avg TPS (out of 10 runs): 641 TPS
2.Pgbench without logical replication enabled with bulk insert (no pub/sub)
Avg TPS (out of 10 runs): 665 TPS
3, Pgbench with logical replication enabled with bulk insert
Avg TPS (out of 10 runs): 278 TPS
4. Pgbench with logical replication streaming on with bulk insert
Avg TPS (out of 10 runs): 440 TPS

As you can see, the bulk inserts, although on a totally unaffected table, does impact the TPS. But what is good is that, enabling streaming improves the TPS (about 58% improvement)


regards,
Ajin Cherian
Fujitsu Australia

 

 

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Thu, Jul 30, 2020 at 12:28 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> I was running some tests on this patch. I was generally trying to see how the patch affects logical replication when
doingbulk inserts. This issue has been raised in the past, for eg: this [1].
 
> My test setup is:
> 1. Two postgres servers running - A and B
> 2. Create a pgbench setup on A. (pgbench -i -s 5 postgres)
> 3. replicate the 3 tables (schema only) on B.
> 4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, pgbench_branches and pgbench_tellers;
> 5. Three subscribers on B for the same tables. (streaming on and off based on the scenarios described below)
>
> run pgbench with : pgbench -c 4 -T 100 postgres
> While pgbench is running, Do a bulk insert on some other table not in the publication list (say t1); INSERT INTO t1
(selecti FROM generate_series(1,10000000) i);
 
>
> Four scenarios:
> 1. Pgbench with logical replication enabled without bulk insert
> Avg TPS (out of 10 runs): 641 TPS
> 2.Pgbench without logical replication enabled with bulk insert (no pub/sub)
> Avg TPS (out of 10 runs): 665 TPS
> 3, Pgbench with logical replication enabled with bulk insert
> Avg TPS (out of 10 runs): 278 TPS
> 4. Pgbench with logical replication streaming on with bulk insert
> Avg TPS (out of 10 runs): 440 TPS
>
> As you can see, the bulk inserts, although on a totally unaffected table, does impact the TPS. But what is good is
that,enabling streaming improves the TPS (about 58% improvement)
 
>

Thanks for doing these tests, it is a good win and probably the reason
is that after patch we won't serialize such big transactions (as shown
in Konstantin's email [1]) and they will be simply skipped.
Basically, it will try to stream such transactions and will skip them
as they are not required to be sent.

[1] - https://www.postgresql.org/message-id/5f5143cc-9f73-3909-3ef7-d3895cc6cc90%40postgrespro.ru

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Ajin Cherian
Дата:

Attaching an updated patch for the stats for streaming based on v2 of Sawada's san replication slot stats framework and v44 of this patch series . This is one patch that has both the stats framework from Sawada-san (1) as well as my update for streaming, so it can be applied easily on top of v44.

regards,
Ajin Cherian
Fujitsu Australia 
Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Jul 29, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Thanks, please find the rebased patch set.
>

Few comments on v44-0001-Implement-streaming-mode-in-ReorderBuffer:
============================================================
1.
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM
generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1, 20) g(i);
+COMMIT;

Is the above comment true?  Because it seems to me that Insert is
getting streamed in the main transaction.

2.
+<programlisting>
+postgres[33712]=#* SELECT * FROM
pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes',
'1');
+    lsn    | xid |                       data
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+

Is the above example correct?  Because we should include XID in the
stream message only when include_xids option is specified.

3.
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.

/speculative then/speculative insert then

4.  I think we can explain the problems (like we can see the wrong
tuple or see two versions of the same tuple or whatever else wrong can
happen, if possible with some example) related to concurrent aborts
somewhere in comments.



-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 29, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Thanks, please find the rebased patch set.
> >
>
> Few comments on v44-0001-Implement-streaming-mode-in-ReorderBuffer:
> ============================================================
> 1.
> +-- streaming with subxact, nothing in main
> +BEGIN;
> +savepoint s1;
> +SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
> +INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM
> generate_series(1, 35) g(i);
> +TRUNCATE table stream_test;
> +rollback to s1;
> +INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
> generate_series(1, 20) g(i);
> +COMMIT;
>
> Is the above comment true?  Because it seems to me that Insert is
> getting streamed in the main transaction.

Changed the comments.

> 2.
> +<programlisting>
> +postgres[33712]=#* SELECT * FROM
> pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes',
> '1');
> +    lsn    | xid |                       data
> +-----------+-----+--------------------------------------------------
> + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
> + 0/16B21F8 | 503 | streaming change for TXN 503
> + 0/16B2300 | 503 | streaming change for TXN 503
> + 0/16B2408 | 503 | streaming change for TXN 503
> + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
> + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
> + 0/16BECA8 | 503 | streaming change for TXN 503
> + 0/16BEDB0 | 503 | streaming change for TXN 503
> + 0/16BEEB8 | 503 | streaming change for TXN 503
> + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
> +(10 rows)
> +</programlisting>
> + </para>
> +
>
> Is the above example correct?  Because we should include XID in the
> stream message only when include_xids option is specified.

include_xids is true if we don't set it to false explicitly

> 3.
>  /*
> - * Queue a change into a transaction so it can be replayed upon commit.
> + * Record the partial change for the streaming of in-progress transactions.  We
> + * can stream only complete changes so if we have a partial change like toast
> + * table insert or speculative then we mark such a 'txn' so that it can't be
> + * streamed.
>
> /speculative then/speculative insert then

Done

> 4.  I think we can explain the problems (like we can see the wrong
> tuple or see two versions of the same tuple or whatever else wrong can
> happen, if possible with some example) related to concurrent aborts
> somewhere in comments.

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 4.  I think we can explain the problems (like we can see the wrong
> > tuple or see two versions of the same tuple or whatever else wrong can
> > happen, if possible with some example) related to concurrent aborts
> > somewhere in comments.
>
> Done
>

I have slightly modified the comment added for the above point and
apart from that added/modified a few comments at other places.  I have
also slightly edited the commit message.

@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb,
TransactionId xid,
  change->lsn = lsn;
  change->txn = txn;
  change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+ change->txn = txn;

This change is not required as the same information is assigned a few
lines before.  So, I have removed this change as well.  Let me know
what you think of the above changes?

Can we add a test for incomplete changes (probably with toast
insertion but we can do it for spec_insert case as well) in
ReorderBuffer such that it needs to first serialize the changes and
then stream it?  I have manually verified such scenarios but it is
good to have the test for the same.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > 4.  I think we can explain the problems (like we can see the wrong
> > > tuple or see two versions of the same tuple or whatever else wrong can
> > > happen, if possible with some example) related to concurrent aborts
> > > somewhere in comments.
> >
> > Done
> >
>
> I have slightly modified the comment added for the above point and
> apart from that added/modified a few comments at other places.  I have
> also slightly edited the commit message.
>
> @@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb,
> TransactionId xid,
>   change->lsn = lsn;
>   change->txn = txn;
>   change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
> + change->txn = txn;
>
> This change is not required as the same information is assigned a few
> lines before.  So, I have removed this change as well.  Let me know
> what you think of the above changes?

Changes look fine to me.

> Can we add a test for incomplete changes (probably with toast
> insertion but we can do it for spec_insert case as well) in
> ReorderBuffer such that it needs to first serialize the changes and
> then stream it?  I have manually verified such scenarios but it is
> good to have the test for the same.

I have added a new test for the same in the stream.sql file.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > Can we add a test for incomplete changes (probably with toast
> > insertion but we can do it for spec_insert case as well) in
> > ReorderBuffer such that it needs to first serialize the changes and
> > then stream it?  I have manually verified such scenarios but it is
> > good to have the test for the same.
>
> I have added a new test for the same in the stream.sql file.
>

Thanks, I have slightly changed the test so that we can consume DDL
changes separately.  I have made a number of other adjustments like
changing few more comments (to make them consistent with nearby
comments), removed unnecessary inclusion of header file, ran pgindent.
The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in
this series looks good to me.  I am planning to push it after one more
read-through unless you or anyone else has any comments on the same.
The patch I am talking about has the following functionality:

Implement streaming mode in ReorderBuffer. Instead of serializing the
transaction to disk after reaching the logical_decoding_work_mem limit
in memory, we consume the changes we have in memory and invoke stream
API methods added by commit 45fdc9738b. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we
can't stream till we have the complete tuple.  And, as soon as we get
the complete tuple we stream the transaction including the serialized
changes. Now that we can stream in-progress transactions, the
concurrent aborts may cause failures when the output plugin consults
catalogs (both system and user-defined). We handle such failures by
returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table
scan APIs to the backend or WALSender decoding a specific uncommitted
transaction. The decoding logic on the receipt of such a sqlerrcode
aborts the decoding of the current transaction and continues with the
decoding of other transactions. We also provide a new option via SQL
APIs to fetch the changes being streamed.

This patch's functionality can be independently verified by SQL APIs

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > Can we add a test for incomplete changes (probably with toast
> > > insertion but we can do it for spec_insert case as well) in
> > > ReorderBuffer such that it needs to first serialize the changes and
> > > then stream it?  I have manually verified such scenarios but it is
> > > good to have the test for the same.
> >
> > I have added a new test for the same in the stream.sql file.
> >
>
> Thanks, I have slightly changed the test so that we can consume DDL
> changes separately.  I have made a number of other adjustments like
> changing few more comments (to make them consistent with nearby
> comments), removed unnecessary inclusion of header file, ran pgindent.
> The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in
> this series looks good to me.  I am planning to push it after one more
> read-through unless you or anyone else has any comments on the same.
> The patch I am talking about has the following functionality:
>
> Implement streaming mode in ReorderBuffer. Instead of serializing the
> transaction to disk after reaching the logical_decoding_work_mem limit
> in memory, we consume the changes we have in memory and invoke stream
> API methods added by commit 45fdc9738b. However, sometimes if we have
> incomplete toast or speculative insert we spill to the disk because we
> can't stream till we have the complete tuple.  And, as soon as we get
> the complete tuple we stream the transaction including the serialized
> changes. Now that we can stream in-progress transactions, the
> concurrent aborts may cause failures when the output plugin consults
> catalogs (both system and user-defined). We handle such failures by
> returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table
> scan APIs to the backend or WALSender decoding a specific uncommitted
> transaction. The decoding logic on the receipt of such a sqlerrcode
> aborts the decoding of the current transaction and continues with the
> decoding of other transactions. We also provide a new option via SQL
> APIs to fetch the changes being streamed.
>
> This patch's functionality can be independently verified by SQL APIs

Your changes look fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
..
> > This patch's functionality can be independently verified by SQL APIs
>
> Your changes look fine to me.
>

I have pushed that patch last week and attached are the remaining
patches. I have made a few changes in the next patch
0001-Extend-the-BufFile-interface.patch and have some comments on it
which are as below:

1.
  case SEEK_END:
- /* could be implemented, not needed currently */
+
+ /*
+ * Get the file size of the last file to get the last offset of
+ * that file.
+ */
+ newFile = file->numFiles - 1;
+ newOffset = FileSize(file->files[file->numFiles - 1]);
+ if (newOffset < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not determine size of temporary file \"%s\" from
BufFile \"%s\": %m",
+ FilePathName(file->files[file->numFiles - 1]),
+ file->name)));
+ break;
  break;

There is no need for multiple breaks in the above code. I have fixed
this one in the attached patch.

2.
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+ int newFile = file->numFiles;
+ off_t newOffset = file->curOffset;
+ char segment_name[MAXPGPATH];
+ int i;
+
+ /* Loop over all the files upto the fileno which we want to truncate. */
+ for (i = file->numFiles - 1; i >= fileno; i--)
+ {
+ /*
+ * Except the fileno, we can directly delete other files.  If the
+ * offset is 0 then we can delete the fileno file as well unless it is
+ * the first file.
+ */
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }
+ else
+ {
+ if (FileTruncate(file->files[i], offset,
+ WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not truncate file \"%s\": %m",
+ FilePathName(file->files[i]))));
+
+ newOffset = offset;
+ }
+ }
+
+ file->numFiles = newFile;
+ file->curOffset = newOffset;
+}

In the end, you have only set 'numFiles' and 'curOffset' members of
BufFile and left others. I think other members like 'curFile' also
need to be set especially for the case where we have deleted segments
at the end, also, shouldn't we need to set 'pos' and 'nbytes' as we do
in BufFileSeek. If there is some reason that we don't to set these
other members then maybe it is better to add a comment to make it
clear.

Another thing we need to think here whether we need to flush the
buffer data for the dirty buffer? Consider a case where we truncate
the file up to a position that falls in the buffer. Now we might
truncate the file and part of buffer contents will become invalid,
next time if we flush such a buffer then the file can contain the
garbage or maybe this will be handled if we update the position in
buffer appropriately but all of this should be explained in comments.
If what I said is correct, then we still can skip buffer flush in some
cases as we do in BufFileSeek. Also, consider if we need to do other
handling (convert seek to "start of next seg" to "end of last seg") as
we do after changing the seek position in BufFileSeek.

3.
/*
 * Initialize a space for temporary files that can be opened by other backends.
 * Other backends must attach to it before accessing it.  Associate this
 * SharedFileSet with 'seg'.  Any contained files will be deleted when the
 * last backend detaches.
 *
 * We can also use this interface if the temporary files are used only by
 * single backend but the files need to be opened and closed multiple times
 * and also the underlying files need to survive across transactions.  For
 * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
 * files on proc exit.
 *
 * Files will be distributed over the tablespaces configured in
 * temp_tablespaces.
 *
 * Under the covers the set is one or more directories which will eventually
 * be deleted when there are no backends attached.
 */
void
SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
{
..

I think we can remove the part of the above comment after 'eventually
be deleted' (see last sentence in comment) because now the files can
be removed in more than one way and we have explained that in the
comments before this last sentence of the comment. If you can rephrase
it differently to cover the other case as well, then that is fine too.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> ..
> > > This patch's functionality can be independently verified by SQL APIs
> >
> > Your changes look fine to me.
> >
>
> I have pushed that patch last week and attached are the remaining
> patches. I have made a few changes in the next patch
> 0001-Extend-the-BufFile-interface.patch and have some comments on it
> which are as below:
>

Few more comments on the latest patches:
v48-0002-Add-support-for-streaming-to-built-in-replicatio
1. It appears to me that we don't remove the temporary folders created
by the apply worker. So, we have folders like
pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when
the apply worker exits. I think we can remove these by calling
PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing
the fileset from registered filesetlist.

2.
+typedef struct SubXactInfo
+{
+ TransactionId xid; /* XID of the subxact */
+ int fileno; /* file number in the buffile */
+ off_t offset; /* offset in the file */
+} SubXactInfo;
+
+static uint32 nsubxacts = 0;
+static uint32 nsubxacts_max = 0;
+static SubXactInfo *subxacts = NULL;
+static TransactionId subxact_last = InvalidTransactionId;

Will it be better if we move all the subxact related variables (like
nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct
as all the information anyway is related to sub-transactions?

3.
+ /*
+ * If there is no subtransaction then nothing to do,  but if already have
+ * subxact file then delete that.
+ */

extra space before 'but' in the above sentence is not required.

v48-0001-Extend-the-BufFile-interface
4.
- * SharedFileSets can also be used by backends when the temporary files need
- * to be opened/closed multiple times and the underlying files need to survive
+ * SharedFileSets can be used by backends when the temporary files need to be
+ * opened/closed multiple times and the underlying files need to survive
  * across transactions.
  *

No need of 'also' in the above sentence.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Thomas Munro
Дата:
On Thu, Aug 13, 2020 at 6:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have pushed that patch last week and attached are the remaining
> patches. I have made a few changes in the next patch
> 0001-Extend-the-BufFile-interface.patch and have some comments on it
> which are as below:

Hi Amit,

I noticed that Konstantin Knizhnik's CF entry 2386 calls
table_scan_XXX() functions from an extension, namely
contrib/auto_explain, and started failing to build on Windows after
commit 7259736a.  This seems to be due to the new global variables
CheckXidAlive and bsysscan, which probably need PGDLLIMPORT if they
are accessed from inline functions that are part of the API that we
expect extensions to be allowed to call.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Aug 14, 2020 at 10:11 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Thu, Aug 13, 2020 at 6:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have pushed that patch last week and attached are the remaining
> > patches. I have made a few changes in the next patch
> > 0001-Extend-the-BufFile-interface.patch and have some comments on it
> > which are as below:
>
> Hi Amit,
>
> I noticed that Konstantin Knizhnik's CF entry 2386 calls
> table_scan_XXX() functions from an extension, namely
> contrib/auto_explain, and started failing to build on Windows after
> commit 7259736a.  This seems to be due to the new global variables
> CheckXidAlive and bsysscan, which probably need PGDLLIMPORT if they
> are accessed from inline functions that are part of the API that we
> expect extensions to be allowed to call.
>

Yeah, that makes sense. I will take care of that later today or
tomorrow. We have not noticed that because currently none of the
extensions is using those functions. BTW, I noticed that after
failure, the next run is green, why so? Is the next run not on
windows?

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Thomas Munro
Дата:
On Fri, Aug 14, 2020 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, that makes sense. I will take care of that later today or
> tomorrow. We have not noticed that because currently none of the
> extensions is using those functions. BTW, I noticed that after
> failure, the next run is green, why so? Is the next run not on
> windows?

The three cfbot results are for applying the patch, testing on Windows
and testing on Ubuntu in that order.  It's not at all clear and I'll
probably find a better way to display it when I get around to adding
some more operating systems, maybe with some OS icons or something
like that...



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Sat, Aug 15, 2020 at 4:14 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Aug 14, 2020 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yeah, that makes sense. I will take care of that later today or
> > tomorrow. We have not noticed that because currently none of the
> > extensions is using those functions. BTW, I noticed that after
> > failure, the next run is green, why so? Is the next run not on
> > windows?
>
> The three cfbot results are for applying the patch, testing on Windows
> and testing on Ubuntu in that order.  It's not at all clear and I'll
> probably find a better way to display it when I get around to adding
> some more operating systems, maybe with some OS icons or something
> like that...
>

Good to know, anyway, I have pushed a patch to mark those variables
with PGDLLIMPORT.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> ..
> > > This patch's functionality can be independently verified by SQL APIs
> >
> > Your changes look fine to me.
> >
>
> I have pushed that patch last week and attached are the remaining
> patches. I have made a few changes in the next patch
> 0001-Extend-the-BufFile-interface.patch and have some comments on it
> which are as below:
>
> 1.
>   case SEEK_END:
> - /* could be implemented, not needed currently */
> +
> + /*
> + * Get the file size of the last file to get the last offset of
> + * that file.
> + */
> + newFile = file->numFiles - 1;
> + newOffset = FileSize(file->files[file->numFiles - 1]);
> + if (newOffset < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not determine size of temporary file \"%s\" from
> BufFile \"%s\": %m",
> + FilePathName(file->files[file->numFiles - 1]),
> + file->name)));
> + break;
>   break;
>
> There is no need for multiple breaks in the above code. I have fixed
> this one in the attached patch.

Ok.

> 2.
> +void
> +BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
> +{
> + int newFile = file->numFiles;
> + off_t newOffset = file->curOffset;
> + char segment_name[MAXPGPATH];
> + int i;
> +
> + /* Loop over all the files upto the fileno which we want to truncate. */
> + for (i = file->numFiles - 1; i >= fileno; i--)
> + {
> + /*
> + * Except the fileno, we can directly delete other files.  If the
> + * offset is 0 then we can delete the fileno file as well unless it is
> + * the first file.
> + */
> + if ((i != fileno || offset == 0) && fileno != 0)
> + {
> + SharedSegmentName(segment_name, file->name, i);
> + FileClose(file->files[i]);
> + if (!SharedFileSetDelete(file->fileset, segment_name, true))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not delete shared fileset \"%s\": %m",
> + segment_name)));
> + newFile--;
> + newOffset = MAX_PHYSICAL_FILESIZE;
> + }
> + else
> + {
> + if (FileTruncate(file->files[i], offset,
> + WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not truncate file \"%s\": %m",
> + FilePathName(file->files[i]))));
> +
> + newOffset = offset;
> + }
> + }
> +
> + file->numFiles = newFile;
> + file->curOffset = newOffset;
> +}
>
> In the end, you have only set 'numFiles' and 'curOffset' members of
> BufFile and left others. I think other members like 'curFile' also
> need to be set especially for the case where we have deleted segments
> at the end,

Yes this must be set.

 also, shouldn't we need to set 'pos' and 'nbytes' as we do
> in BufFileSeek. If there is some reason that we don't to set these
> other members then maybe it is better to add a comment to make it
> clear.

IMHO, we can directly call the BufFileFlush, this will reset the pos
and nbytes and we can directly set the absolute location of the
curOffset.  Next time BufFileRead/BufFileWrite reread the buffer so
everything will be fine.

> Another thing we need to think here whether we need to flush the
> buffer data for the dirty buffer? Consider a case where we truncate
> the file up to a position that falls in the buffer. Now we might
> truncate the file and part of buffer contents will become invalid,
> next time if we flush such a buffer then the file can contain the
> garbage or maybe this will be handled if we update the position in
> buffer appropriately but all of this should be explained in comments.
> If what I said is correct, then we still can skip buffer flush in some
> cases as we do in BufFileSeek.

I think all the cases we can flush the buffer and reset the pos and nbytes.

 Also, consider if we need to do other
> handling (convert seek to "start of next seg" to "end of last seg") as
> we do after changing the seek position in BufFileSeek.

We also do this when we truncate complete file, see this
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }

> 3.
> /*
>  * Initialize a space for temporary files that can be opened by other backends.
>  * Other backends must attach to it before accessing it.  Associate this
>  * SharedFileSet with 'seg'.  Any contained files will be deleted when the
>  * last backend detaches.
>  *
>  * We can also use this interface if the temporary files are used only by
>  * single backend but the files need to be opened and closed multiple times
>  * and also the underlying files need to survive across transactions.  For
>  * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
>  * files on proc exit.
>  *
>  * Files will be distributed over the tablespaces configured in
>  * temp_tablespaces.
>  *
>  * Under the covers the set is one or more directories which will eventually
>  * be deleted when there are no backends attached.
>  */
> void
> SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
> {
> ..
>
> I think we can remove the part of the above comment after 'eventually
> be deleted' (see last sentence in comment) because now the files can
> be removed in more than one way and we have explained that in the
> comments before this last sentence of the comment. If you can rephrase
> it differently to cover the other case as well, then that is fine too.

I think it makes sense to remove, so I have removed it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Thu, Aug 13, 2020 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > ..
> > > > This patch's functionality can be independently verified by SQL APIs
> > >
> > > Your changes look fine to me.
> > >
> >
> > I have pushed that patch last week and attached are the remaining
> > patches. I have made a few changes in the next patch
> > 0001-Extend-the-BufFile-interface.patch and have some comments on it
> > which are as below:
> >
>
> Few more comments on the latest patches:
> v48-0002-Add-support-for-streaming-to-built-in-replicatio
> 1. It appears to me that we don't remove the temporary folders created
> by the apply worker. So, we have folders like
> pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when
> the apply worker exits. I think we can remove these by calling
> PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing
> the fileset from registered filesetlist.

I think we need to call SharedFileSetDeleteAll(input_fileset), from
SharedFileSetUnregister, so that all the directories created for this
fileset are removed

> 2.
> +typedef struct SubXactInfo
> +{
> + TransactionId xid; /* XID of the subxact */
> + int fileno; /* file number in the buffile */
> + off_t offset; /* offset in the file */
> +} SubXactInfo;
> +
> +static uint32 nsubxacts = 0;
> +static uint32 nsubxacts_max = 0;
> +static SubXactInfo *subxacts = NULL;
> +static TransactionId subxact_last = InvalidTransactionId;
>
> Will it be better if we move all the subxact related variables (like
> nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct
> as all the information anyway is related to sub-transactions?

I have moved them all to a structure.

> 3.
> + /*
> + * If there is no subtransaction then nothing to do,  but if already have
> + * subxact file then delete that.
> + */
>
> extra space before 'but' in the above sentence is not required.

Fixed

> v48-0001-Extend-the-BufFile-interface
> 4.
> - * SharedFileSets can also be used by backends when the temporary files need
> - * to be opened/closed multiple times and the underlying files need to survive
> + * SharedFileSets can be used by backends when the temporary files need to be
> + * opened/closed multiple times and the underlying files need to survive
>   * across transactions.
>   *
>
> No need of 'also' in the above sentence.

Fixed


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Sat, Aug 15, 2020 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 13, 2020 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > ..
> > > > > This patch's functionality can be independently verified by SQL APIs
> > > >
> > > > Your changes look fine to me.
> > > >
> > >
> > > I have pushed that patch last week and attached are the remaining
> > > patches. I have made a few changes in the next patch
> > > 0001-Extend-the-BufFile-interface.patch and have some comments on it
> > > which are as below:
> > >
> >
> > Few more comments on the latest patches:
> > v48-0002-Add-support-for-streaming-to-built-in-replicatio
> > 1. It appears to me that we don't remove the temporary folders created
> > by the apply worker. So, we have folders like
> > pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when
> > the apply worker exits. I think we can remove these by calling
> > PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing
> > the fileset from registered filesetlist.
>
> I think we need to call SharedFileSetDeleteAll(input_fileset), from
> SharedFileSetUnregister, so that all the directories created for this
> fileset are removed
>
> > 2.
> > +typedef struct SubXactInfo
> > +{
> > + TransactionId xid; /* XID of the subxact */
> > + int fileno; /* file number in the buffile */
> > + off_t offset; /* offset in the file */
> > +} SubXactInfo;
> > +
> > +static uint32 nsubxacts = 0;
> > +static uint32 nsubxacts_max = 0;
> > +static SubXactInfo *subxacts = NULL;
> > +static TransactionId subxact_last = InvalidTransactionId;
> >
> > Will it be better if we move all the subxact related variables (like
> > nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct
> > as all the information anyway is related to sub-transactions?
>
> I have moved them all to a structure.
>
> > 3.
> > + /*
> > + * If there is no subtransaction then nothing to do,  but if already have
> > + * subxact file then delete that.
> > + */
> >
> > extra space before 'but' in the above sentence is not required.
>
> Fixed
>
> > v48-0001-Extend-the-BufFile-interface
> > 4.
> > - * SharedFileSets can also be used by backends when the temporary files need
> > - * to be opened/closed multiple times and the underlying files need to survive
> > + * SharedFileSets can be used by backends when the temporary files need to be
> > + * opened/closed multiple times and the underlying files need to survive
> >   * across transactions.
> >   *
> >
> > No need of 'also' in the above sentence.
>
> Fixed
>

In last patch v49-0001, there is one issue,  Basically, I have called
BufFileFlush in all the cases.  But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now.  So I think if the truncate position is within
the same buffer we just need to adjust the buffer,  otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0.  Attached patch fixes this issue.

+ errmsg("could not truncate file \"%s\": %m",
+ FilePathName(file->files[i]))));
+ curOffset = offset;
+ }
+ }
+
+ /* Otherwise, must reposition buffer, so flush any dirty data */
+ BufFileFlush(file);
+

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> In last patch v49-0001, there is one issue,  Basically, I have called
> BufFileFlush in all the cases.  But, ideally, we can not call this if
> the underlying files are deleted/truncated because those files/blocks
> might not exist now.  So I think if the truncate position is within
> the same buffer we just need to adjust the buffer,  otherwise we just
> need to set the currFile and currOffset to the absolute number and set
> the pos and nbytes 0.  Attached patch fixes this issue.
>

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

2.
+ int curFile = file->curFile;
+ off_t curOffset = file->curOffset;

I find the previous naming (newFile, newOffset) was better as it
distinguishes them from BufFile variables.

3.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
..
+ /* Delete all files in the set */
+ SharedFileSetDeleteAll(input_fileset);
..
}

I am not sure if this is completely correct because we call this
function (SharedFileSetUnregister) from BufFileDeleteShared which
would have already removed all the required files. This raises the
question in my mind whether it is correct to call
SharedFileSetUnregister from BufFileDeleteShared from the API
perspective as one might not want to remove the entire fileset at that
point of time. It will work for your use case (where while removing
buffile you also want to remove the entire fileset) but not sure if it
is generic enough. For your case, I wonder if we can directly call
SharedFileSetDeleteAll and we can have a call like
SharedFileSetUnregister which will be called from it.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > In last patch v49-0001, there is one issue,  Basically, I have called
> > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > the underlying files are deleted/truncated because those files/blocks
> > might not exist now.  So I think if the truncate position is within
> > the same buffer we just need to adjust the buffer,  otherwise we just
> > need to set the currFile and currOffset to the absolute number and set
> > the pos and nbytes 0.  Attached patch fixes this issue.
> >
>
> Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> 1.
> +
> + /*
> + * If the truncate point is within existing buffer then we can just
> + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> + * we don't need to do anything because we have already deleted/truncated
> + * the underlying files.
> + */
> + if (curFile == file->curFile &&
> + curOffset >= file->curOffset &&
> + curOffset <= file->curOffset + file->nbytes)
> + {
> + file->pos = (int) (curOffset - file->curOffset);
> + return;
> + }
>
> I think in this case you have set the position correctly but what
> about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> because the contents of the buffer are still valid but I don't think
> the same is true here.
>

I think you need to set 'nbytes' to curOffset as per your current
patch as that is the new size of the file.
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
off_t offset)
                curOffset <= file->curOffset + file->nbytes)
        {
                file->pos = (int) (curOffset - file->curOffset);
+               file->nbytes = (int) curOffset;
                return;
        }

Also, what about file 'numFiles', that can also change due to the
removal of certain files, shouldn't that be also set in this case?

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Aug 19, 2020 at 10:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > In last patch v49-0001, there is one issue,  Basically, I have called
> > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > the underlying files are deleted/truncated because those files/blocks
> > might not exist now.  So I think if the truncate position is within
> > the same buffer we just need to adjust the buffer,  otherwise we just
> > need to set the currFile and currOffset to the absolute number and set
> > the pos and nbytes 0.  Attached patch fixes this issue.
> >
>
> Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> 1.
> +
> + /*
> + * If the truncate point is within existing buffer then we can just
> + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> + * we don't need to do anything because we have already deleted/truncated
> + * the underlying files.
> + */
> + if (curFile == file->curFile &&
> + curOffset >= file->curOffset &&
> + curOffset <= file->curOffset + file->nbytes)
> + {
> + file->pos = (int) (curOffset - file->curOffset);
> + return;
> + }
>
> I think in this case you have set the position correctly but what
> about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> because the contents of the buffer are still valid but I don't think
> the same is true here.

Right, I think we need to set nbytes to new file->pos as shown below

> + file->pos = (int) (curOffset - file->curOffset);
>  file->nbytes = file->pos


> 2.
> + int curFile = file->curFile;
> + off_t curOffset = file->curOffset;
>
> I find the previous naming (newFile, newOffset) was better as it
> distinguishes them from BufFile variables.

Ok

> 3.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> ..
> + /* Delete all files in the set */
> + SharedFileSetDeleteAll(input_fileset);
> ..
> }
>
> I am not sure if this is completely correct because we call this
> function (SharedFileSetUnregister) from BufFileDeleteShared which
> would have already removed all the required files. This raises the
> question in my mind whether it is correct to call
> SharedFileSetUnregister from BufFileDeleteShared from the API
> perspective as one might not want to remove the entire fileset at that
> point of time. It will work for your use case (where while removing
> buffile you also want to remove the entire fileset) but not sure if it
> is generic enough. For your case, I wonder if we can directly call
> SharedFileSetDeleteAll and we can have a call like
> SharedFileSetUnregister which will be called from it.

Yeah this make more sense to me that we can directly call
SharedFileSetDeleteAll, instead of calling BufFileDeleteShared and we
can call SharedFileSetUnregister from SharedFileSetDeleteAll.

I will make these changes and send the patch after some testing.




--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > >
> > > In last patch v49-0001, there is one issue,  Basically, I have called
> > > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > > the underlying files are deleted/truncated because those files/blocks
> > > might not exist now.  So I think if the truncate position is within
> > > the same buffer we just need to adjust the buffer,  otherwise we just
> > > need to set the currFile and currOffset to the absolute number and set
> > > the pos and nbytes 0.  Attached patch fixes this issue.
> > >
> >
> > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> > 1.
> > +
> > + /*
> > + * If the truncate point is within existing buffer then we can just
> > + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> > + * we don't need to do anything because we have already deleted/truncated
> > + * the underlying files.
> > + */
> > + if (curFile == file->curFile &&
> > + curOffset >= file->curOffset &&
> > + curOffset <= file->curOffset + file->nbytes)
> > + {
> > + file->pos = (int) (curOffset - file->curOffset);
> > + return;
> > + }
> >
> > I think in this case you have set the position correctly but what
> > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> > because the contents of the buffer are still valid but I don't think
> > the same is true here.
> >
>
> I think you need to set 'nbytes' to curOffset as per your current
> patch as that is the new size of the file.
> --- a/src/backend/storage/file/buffile.c
> +++ b/src/backend/storage/file/buffile.c
> @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
> off_t offset)
>                 curOffset <= file->curOffset + file->nbytes)
>         {
>                 file->pos = (int) (curOffset - file->curOffset);
> +               file->nbytes = (int) curOffset;
>                 return;
>         }
>
> Also, what about file 'numFiles', that can also change due to the
> removal of certain files, shouldn't that be also set in this case

Right, we need to set the numFile.  I will fix this as well.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > >
> > > > In last patch v49-0001, there is one issue,  Basically, I have called
> > > > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > > > the underlying files are deleted/truncated because those files/blocks
> > > > might not exist now.  So I think if the truncate position is within
> > > > the same buffer we just need to adjust the buffer,  otherwise we just
> > > > need to set the currFile and currOffset to the absolute number and set
> > > > the pos and nbytes 0.  Attached patch fixes this issue.
> > > >
> > >
> > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> > > 1.
> > > +
> > > + /*
> > > + * If the truncate point is within existing buffer then we can just
> > > + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> > > + * we don't need to do anything because we have already deleted/truncated
> > > + * the underlying files.
> > > + */
> > > + if (curFile == file->curFile &&
> > > + curOffset >= file->curOffset &&
> > > + curOffset <= file->curOffset + file->nbytes)
> > > + {
> > > + file->pos = (int) (curOffset - file->curOffset);
> > > + return;
> > > + }
> > >
> > > I think in this case you have set the position correctly but what
> > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> > > because the contents of the buffer are still valid but I don't think
> > > the same is true here.
> > >
> >
> > I think you need to set 'nbytes' to curOffset as per your current
> > patch as that is the new size of the file.
> > --- a/src/backend/storage/file/buffile.c
> > +++ b/src/backend/storage/file/buffile.c
> > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
> > off_t offset)
> >                 curOffset <= file->curOffset + file->nbytes)
> >         {
> >                 file->pos = (int) (curOffset - file->curOffset);
> > +               file->nbytes = (int) curOffset;
> >                 return;
> >         }
> >
> > Also, what about file 'numFiles', that can also change due to the
> > removal of certain files, shouldn't that be also set in this case
>
> Right, we need to set the numFile.  I will fix this as well.

I think there are a couple of more problems in the truncate APIs,
basically, if the curFile and curOffset are already smaller than the
truncate location the truncate should not change that.  So the
truncate should only change the curFile and curOffset if it is
truncating the part of the file where the curFile or curOffset is
pointing.  I will work on those along with your other comments and
submit the updated patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Thu, Aug 20, 2020 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > >
> > > > > In last patch v49-0001, there is one issue,  Basically, I have called
> > > > > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > > > > the underlying files are deleted/truncated because those files/blocks
> > > > > might not exist now.  So I think if the truncate position is within
> > > > > the same buffer we just need to adjust the buffer,  otherwise we just
> > > > > need to set the currFile and currOffset to the absolute number and set
> > > > > the pos and nbytes 0.  Attached patch fixes this issue.
> > > > >
> > > >
> > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> > > > 1.
> > > > +
> > > > + /*
> > > > + * If the truncate point is within existing buffer then we can just
> > > > + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> > > > + * we don't need to do anything because we have already deleted/truncated
> > > > + * the underlying files.
> > > > + */
> > > > + if (curFile == file->curFile &&
> > > > + curOffset >= file->curOffset &&
> > > > + curOffset <= file->curOffset + file->nbytes)
> > > > + {
> > > > + file->pos = (int) (curOffset - file->curOffset);
> > > > + return;
> > > > + }
> > > >
> > > > I think in this case you have set the position correctly but what
> > > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> > > > because the contents of the buffer are still valid but I don't think
> > > > the same is true here.
> > > >
> > >
> > > I think you need to set 'nbytes' to curOffset as per your current
> > > patch as that is the new size of the file.
> > > --- a/src/backend/storage/file/buffile.c
> > > +++ b/src/backend/storage/file/buffile.c
> > > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
> > > off_t offset)
> > >                 curOffset <= file->curOffset + file->nbytes)
> > >         {
> > >                 file->pos = (int) (curOffset - file->curOffset);
> > > +               file->nbytes = (int) curOffset;
> > >                 return;
> > >         }
> > >
> > > Also, what about file 'numFiles', that can also change due to the
> > > removal of certain files, shouldn't that be also set in this case
> >
> > Right, we need to set the numFile.  I will fix this as well.
>
> I think there are a couple of more problems in the truncate APIs,
> basically, if the curFile and curOffset are already smaller than the
> truncate location the truncate should not change that.  So the
> truncate should only change the curFile and curOffset if it is
> truncating the part of the file where the curFile or curOffset is
> pointing.
>

Right, I think this can happen if one has changed those by BufFileSeek
before doing truncate. We should fix that case as well.

>  I will work on those along with your other comments and
> submit the updated patch.
>

Thanks.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > In last patch v49-0001, there is one issue,  Basically, I have called
> > > > > > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > > > > > the underlying files are deleted/truncated because those files/blocks
> > > > > > might not exist now.  So I think if the truncate position is within
> > > > > > the same buffer we just need to adjust the buffer,  otherwise we just
> > > > > > need to set the currFile and currOffset to the absolute number and set
> > > > > > the pos and nbytes 0.  Attached patch fixes this issue.
> > > > > >
> > > > >
> > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> > > > > 1.
> > > > > +
> > > > > + /*
> > > > > + * If the truncate point is within existing buffer then we can just
> > > > > + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> > > > > + * we don't need to do anything because we have already deleted/truncated
> > > > > + * the underlying files.
> > > > > + */
> > > > > + if (curFile == file->curFile &&
> > > > > + curOffset >= file->curOffset &&
> > > > > + curOffset <= file->curOffset + file->nbytes)
> > > > > + {
> > > > > + file->pos = (int) (curOffset - file->curOffset);
> > > > > + return;
> > > > > + }
> > > > >
> > > > > I think in this case you have set the position correctly but what
> > > > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> > > > > because the contents of the buffer are still valid but I don't think
> > > > > the same is true here.
> > > > >
> > > >
> > > > I think you need to set 'nbytes' to curOffset as per your current
> > > > patch as that is the new size of the file.
> > > > --- a/src/backend/storage/file/buffile.c
> > > > +++ b/src/backend/storage/file/buffile.c
> > > > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
> > > > off_t offset)
> > > >                 curOffset <= file->curOffset + file->nbytes)
> > > >         {
> > > >                 file->pos = (int) (curOffset - file->curOffset);
> > > > +               file->nbytes = (int) curOffset;
> > > >                 return;
> > > >         }
> > > >
> > > > Also, what about file 'numFiles', that can also change due to the
> > > > removal of certain files, shouldn't that be also set in this case
> > >
> > > Right, we need to set the numFile.  I will fix this as well.
> >
> > I think there are a couple of more problems in the truncate APIs,
> > basically, if the curFile and curOffset are already smaller than the
> > truncate location the truncate should not change that.  So the
> > truncate should only change the curFile and curOffset if it is
> > truncating the part of the file where the curFile or curOffset is
> > pointing.
> >
>
> Right, I think this can happen if one has changed those by BufFileSeek
> before doing truncate. We should fix that case as well.

Right.

> >  I will work on those along with your other comments and
> > submit the updated patch.

I have fixed this in the attached patch along with your other
comments.  I have also attached a contrib module that is just used for
testing the truncate API.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Right, I think this can happen if one has changed those by BufFileSeek
> > before doing truncate. We should fix that case as well.
>
> Right.
>
> > >  I will work on those along with your other comments and
> > > submit the updated patch.
>
> I have fixed this in the attached patch along with your other
> comments.  I have also attached a contrib module that is just used for
> testing the truncate API.
>

Few comments:
==============
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
{
..
+ if ((i != fileno || offset == 0) && i != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ numFiles--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+
+ if (i == fileno)
+ newFile--;
+ }

Here, shouldn't it be i <= fileno? Because we need to move back the
curFile up to newFile whenever curFile is greater than newFile

2.
+ /*
+ * If the new location is smaller then the current location in file then
+ * we need to set the curFile and the curOffset to the new values and also
+ * reset the pos and nbytes.  Otherwise nothing to do.
+ */
+ else if ((newFile < file->curFile) ||
+ newOffset < file->curOffset + file->pos)
+ {
+ file->curFile = newFile;
+ file->curOffset = newOffset;
+ file->pos = 0;
+ file->nbytes = 0;
+ }

Shouldn't there be && instead of || because if newFile is greater than
curFile then there is no meaning to update it?

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Right, I think this can happen if one has changed those by BufFileSeek
> > > before doing truncate. We should fix that case as well.
> >
> > Right.
> >
> > > >  I will work on those along with your other comments and
> > > > submit the updated patch.
> >
> > I have fixed this in the attached patch along with your other
> > comments.  I have also attached a contrib module that is just used for
> > testing the truncate API.
> >
>
> Few comments:
> ==============
> +void
> +BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
> {
> ..
> + if ((i != fileno || offset == 0) && i != 0)
> + {
> + SharedSegmentName(segment_name, file->name, i);
> + FileClose(file->files[i]);
> + if (!SharedFileSetDelete(file->fileset, segment_name, true))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not delete shared fileset \"%s\": %m",
> + segment_name)));
> + numFiles--;
> + newOffset = MAX_PHYSICAL_FILESIZE;
> +
> + if (i == fileno)
> + newFile--;
> + }
>
> Here, shouldn't it be i <= fileno? Because we need to move back the
> curFile up to newFile whenever curFile is greater than newFile
>
> 2.
> + /*
> + * If the new location is smaller then the current location in file then
> + * we need to set the curFile and the curOffset to the new values and also
> + * reset the pos and nbytes.  Otherwise nothing to do.
> + */
> + else if ((newFile < file->curFile) ||
> + newOffset < file->curOffset + file->pos)
> + {
> + file->curFile = newFile;
> + file->curOffset = newOffset;
> + file->pos = 0;
> + file->nbytes = 0;
> + }
>
> Shouldn't there be && instead of || because if newFile is greater than
> curFile then there is no meaning to update it?
>

Wait, actually, it is not clear to me which case second condition
(newOffset < file->curOffset + file->pos) is trying to cover, so I
can't recommend anything for this. Can you please explain to me why
you have added the second condition in the above check?


-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Right, I think this can happen if one has changed those by BufFileSeek
> > > before doing truncate. We should fix that case as well.
> >
> > Right.
> >
> > > >  I will work on those along with your other comments and
> > > > submit the updated patch.
> >
> > I have fixed this in the attached patch along with your other
> > comments.  I have also attached a contrib module that is just used for
> > testing the truncate API.
> >
>
> Few comments:
> ==============
> +void
> +BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
> {
> ..
> + if ((i != fileno || offset == 0) && i != 0)
> + {
> + SharedSegmentName(segment_name, file->name, i);
> + FileClose(file->files[i]);
> + if (!SharedFileSetDelete(file->fileset, segment_name, true))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not delete shared fileset \"%s\": %m",
> + segment_name)));
> + numFiles--;
> + newOffset = MAX_PHYSICAL_FILESIZE;
> +
> + if (i == fileno)
> + newFile--;
> + }
>
> Here, shouldn't it be i <= fileno? Because we need to move back the
> curFile up to newFile whenever curFile is greater than newFile
>

I think now I have understood why you have added this condition but
probably a comment on the lines "This is required to indicate that we
have removed the given fileno" would be better for future readers.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Right, I think this can happen if one has changed those by BufFileSeek
> > > before doing truncate. We should fix that case as well.
> >
> > Right.
> >
> > > >  I will work on those along with your other comments and
> > > > submit the updated patch.
> >
> > I have fixed this in the attached patch along with your other
> > comments.  I have also attached a contrib module that is just used for
> > testing the truncate API.
> >
>
> Few comments:
> ==============
> +void
> +BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
> {
> ..
> + if ((i != fileno || offset == 0) && i != 0)
> + {
> + SharedSegmentName(segment_name, file->name, i);
> + FileClose(file->files[i]);
> + if (!SharedFileSetDelete(file->fileset, segment_name, true))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not delete shared fileset \"%s\": %m",
> + segment_name)));
> + numFiles--;
> + newOffset = MAX_PHYSICAL_FILESIZE;
> +
> + if (i == fileno)
> + newFile--;
> + }
>
> Here, shouldn't it be i <= fileno? Because we need to move back the
> curFile up to newFile whenever curFile is greater than newFile

+/* Loop over all the files upto the fileno which we want to truncate. */
+for (i = file->numFiles - 1; i >= fileno; i--)

Because the above loop is up to the fileno, so I feel there is no
point of that check or any assert.

> 2.
> + /*
> + * If the new location is smaller then the current location in file then
> + * we need to set the curFile and the curOffset to the new values and also
> + * reset the pos and nbytes.  Otherwise nothing to do.
> + */
> + else if ((newFile < file->curFile) ||
> + newOffset < file->curOffset + file->pos)
> + {
> + file->curFile = newFile;
> + file->curOffset = newOffset;
> + file->pos = 0;
> + file->nbytes = 0;
> + }
>
> Shouldn't there be && instead of || because if newFile is greater than
> curFile then there is no meaning to update it?

I think this condition is wrong it should be,

else if ((newFile < file->curFile) || ((newFile == file->curFile) &&
(newOffset < file->curOffset + file->pos)

Basically, either new file is smaller otherwise if it is the same
then-new offset should be smaller.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Fri, Aug 21, 2020 at 10:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 20, 2020 at 5:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > Right, I think this can happen if one has changed those by BufFileSeek
> > > > before doing truncate. We should fix that case as well.
> > >
> > > Right.
> > >
> > > > >  I will work on those along with your other comments and
> > > > > submit the updated patch.
> > >
> > > I have fixed this in the attached patch along with your other
> > > comments.  I have also attached a contrib module that is just used for
> > > testing the truncate API.
> > >
> >
> > Few comments:
> > ==============
> > +void
> > +BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
> > {
> > ..
> > + if ((i != fileno || offset == 0) && i != 0)
> > + {
> > + SharedSegmentName(segment_name, file->name, i);
> > + FileClose(file->files[i]);
> > + if (!SharedFileSetDelete(file->fileset, segment_name, true))
> > + ereport(ERROR,
> > + (errcode_for_file_access(),
> > + errmsg("could not delete shared fileset \"%s\": %m",
> > + segment_name)));
> > + numFiles--;
> > + newOffset = MAX_PHYSICAL_FILESIZE;
> > +
> > + if (i == fileno)
> > + newFile--;
> > + }
> >
> > Here, shouldn't it be i <= fileno? Because we need to move back the
> > curFile up to newFile whenever curFile is greater than newFile
> >
>
> I think now I have understood why you have added this condition but
> probably a comment on the lines "This is required to indicate that we
> have removed the given fileno" would be better for future readers.

Okay.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Aug 21, 2020 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 2.
> > + /*
> > + * If the new location is smaller then the current location in file then
> > + * we need to set the curFile and the curOffset to the new values and also
> > + * reset the pos and nbytes.  Otherwise nothing to do.
> > + */
> > + else if ((newFile < file->curFile) ||
> > + newOffset < file->curOffset + file->pos)
> > + {
> > + file->curFile = newFile;
> > + file->curOffset = newOffset;
> > + file->pos = 0;
> > + file->nbytes = 0;
> > + }
> >
> > Shouldn't there be && instead of || because if newFile is greater than
> > curFile then there is no meaning to update it?
>
> I think this condition is wrong it should be,
>
> else if ((newFile < file->curFile) || ((newFile == file->curFile) &&
> (newOffset < file->curOffset + file->pos)
>
> Basically, either new file is smaller otherwise if it is the same
> then-new offset should be smaller.
>

I think we don't need to use file->pos for that as that is required
only for the current buffer, otherwise, such a condition should
suffice the need. However, I was not happy with the way code and
conditions were arranged in BufFileTruncateShared, so I have
re-arranged them and change quite a few comments in that API. Apart
from that I have updated the docs and ran pgindent for the first
patch. Do let me know if you have any more comments on the first
patch?

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Fri, Aug 21, 2020 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 21, 2020 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 2.
> > > + /*
> > > + * If the new location is smaller then the current location in file then
> > > + * we need to set the curFile and the curOffset to the new values and also
> > > + * reset the pos and nbytes.  Otherwise nothing to do.
> > > + */
> > > + else if ((newFile < file->curFile) ||
> > > + newOffset < file->curOffset + file->pos)
> > > + {
> > > + file->curFile = newFile;
> > > + file->curOffset = newOffset;
> > > + file->pos = 0;
> > > + file->nbytes = 0;
> > > + }
> > >
> > > Shouldn't there be && instead of || because if newFile is greater than
> > > curFile then there is no meaning to update it?
> >
> > I think this condition is wrong it should be,
> >
> > else if ((newFile < file->curFile) || ((newFile == file->curFile) &&
> > (newOffset < file->curOffset + file->pos)
> >
> > Basically, either new file is smaller otherwise if it is the same
> > then-new offset should be smaller.
> >
>
> I think we don't need to use file->pos for that as that is required
> only for the current buffer, otherwise, such a condition should
> suffice the need. However, I was not happy with the way code and
> conditions were arranged in BufFileTruncateShared, so I have
> re-arranged them and change quite a few comments in that API. Apart
> from that I have updated the docs and ran pgindent for the first
> patch. Do let me know if you have any more comments on the first
> patch?

I have reviewed and tested the patch and the changes look fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have reviewed and tested the patch and the changes look fine to me.
>

Thanks, I will push the next patch early next week (by Tuesday) unless
you or someone else has any more comments on it. The summary of the
patch (v52-0001-Extend-the-BufFile-interface, attached with my
previous email) I am planning to push is: "It extends the BufFile
interface to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such
files need to be created as a member of a SharedFileSet. We have
implemented the interface for BufFileTruncate to allow files to be
truncated up to a particular offset and extended the BufFileSeek API
to support SEEK_END case. We have also added an option to provide a
mode while opening the shared BufFiles instead of always opening in
read-only mode. These enhancements in BufFile interface are required
for the upcoming patch to allow the replication apply worker, to
properly handle streamed in-progress transactions."

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have reviewed and tested the patch and the changes look fine to me.
> >
>
> Thanks, I will push the next patch early next week (by Tuesday) unless
> you or someone else has any more comments on it. The summary of the
> patch (v52-0001-Extend-the-BufFile-interface, attached with my
> previous email) I am planning to push is: "It extends the BufFile
> interface to support temporary files that can be used by the single
> backend when the corresponding files need to be survived across the
> transaction and need to be opened and closed multiple times. Such
> files need to be created as a member of a SharedFileSet. We have
> implemented the interface for BufFileTruncate to allow files to be
> truncated up to a particular offset and extended the BufFileSeek API
> to support SEEK_END case. We have also added an option to provide a
> mode while opening the shared BufFiles instead of always opening in
> read-only mode. These enhancements in BufFile interface are required
> for the upcoming patch to allow the replication apply worker, to
> properly handle streamed in-progress transactions."

While reviewing 0002, I realized that instead of using individual
shared fileset for each transaction, we can use just one common shared
file set.  We can create individual buffile under one shared fileset
and whenever a transaction commits/aborts we can just delete its
buffile and the shared fileset can stay.

I have attached a POC patch for this idea and if we agree with this
approach then I will prepare a final patch in a couple of days.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have reviewed and tested the patch and the changes look fine to me.
> > >
> >
> > Thanks, I will push the next patch early next week (by Tuesday) unless
> > you or someone else has any more comments on it. The summary of the
> > patch (v52-0001-Extend-the-BufFile-interface, attached with my
> > previous email) I am planning to push is: "It extends the BufFile
> > interface to support temporary files that can be used by the single
> > backend when the corresponding files need to be survived across the
> > transaction and need to be opened and closed multiple times. Such
> > files need to be created as a member of a SharedFileSet. We have
> > implemented the interface for BufFileTruncate to allow files to be
> > truncated up to a particular offset and extended the BufFileSeek API
> > to support SEEK_END case. We have also added an option to provide a
> > mode while opening the shared BufFiles instead of always opening in
> > read-only mode. These enhancements in BufFile interface are required
> > for the upcoming patch to allow the replication apply worker, to
> > properly handle streamed in-progress transactions."
>
> While reviewing 0002, I realized that instead of using individual
> shared fileset for each transaction, we can use just one common shared
> file set.  We can create individual buffile under one shared fileset
> and whenever a transaction commits/aborts we can just delete its
> buffile and the shared fileset can stay.
>

I think the existing design is superior as it allows the flexibility
to create transaction files in different temp_tablespaces which is
quite important to consider as we know the files will be created only
for large transactions. Once we fix the sharedfileset for a worker all
the files will be created in the temp_tablespaces chosen for the first
time apply worker creates it even if it got changed at some later
point of time (user can change its value and then do reload config
which I think will impact the worker settings as well). This all can
happen because we set the tablespaces at the time of
SharedFileSetInit.

The other relatively smaller thing which I don't like is that we
always need to create a buffile for subxact even though we don't need
it. We might be able to find some solution for this but I guess the
previous point is what bothers me more.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > I have reviewed and tested the patch and the changes look fine to me.
> > > >
> > >
> > > Thanks, I will push the next patch early next week (by Tuesday) unless
> > > you or someone else has any more comments on it. The summary of the
> > > patch (v52-0001-Extend-the-BufFile-interface, attached with my
> > > previous email) I am planning to push is: "It extends the BufFile
> > > interface to support temporary files that can be used by the single
> > > backend when the corresponding files need to be survived across the
> > > transaction and need to be opened and closed multiple times. Such
> > > files need to be created as a member of a SharedFileSet. We have
> > > implemented the interface for BufFileTruncate to allow files to be
> > > truncated up to a particular offset and extended the BufFileSeek API
> > > to support SEEK_END case. We have also added an option to provide a
> > > mode while opening the shared BufFiles instead of always opening in
> > > read-only mode. These enhancements in BufFile interface are required
> > > for the upcoming patch to allow the replication apply worker, to
> > > properly handle streamed in-progress transactions."
> >
> > While reviewing 0002, I realized that instead of using individual
> > shared fileset for each transaction, we can use just one common shared
> > file set.  We can create individual buffile under one shared fileset
> > and whenever a transaction commits/aborts we can just delete its
> > buffile and the shared fileset can stay.
> >
>
> I think the existing design is superior as it allows the flexibility
> to create transaction files in different temp_tablespaces which is
> quite important to consider as we know the files will be created only
> for large transactions. Once we fix the sharedfileset for a worker all
> the files will be created in the temp_tablespaces chosen for the first
> time apply worker creates it even if it got changed at some later
> point of time (user can change its value and then do reload config
> which I think will impact the worker settings as well). This all can
> happen because we set the tablespaces at the time of
> SharedFileSetInit.

Yeah, I agree with this point,  that if we use the single shared
fileset then it will always use the same tablespace for all the
streaming transactions.  And, we might get the benefit of concurrent
I/O if we use different tablespaces as we are not immediately flushing
the files to the disk.

> The other relatively smaller thing which I don't like is that we
> always need to create a buffile for subxact even though we don't need
> it. We might be able to find some solution for this but I guess the
> previous point is what bothers me more.

Yeah, if we go this way we might need to find some solution to this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Aug 25, 2020 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I think the existing design is superior as it allows the flexibility
> > to create transaction files in different temp_tablespaces which is
> > quite important to consider as we know the files will be created only
> > for large transactions. Once we fix the sharedfileset for a worker all
> > the files will be created in the temp_tablespaces chosen for the first
> > time apply worker creates it even if it got changed at some later
> > point of time (user can change its value and then do reload config
> > which I think will impact the worker settings as well). This all can
> > happen because we set the tablespaces at the time of
> > SharedFileSetInit.
>
> Yeah, I agree with this point,  that if we use the single shared
> fileset then it will always use the same tablespace for all the
> streaming transactions.  And, we might get the benefit of concurrent
> I/O if we use different tablespaces as we are not immediately flushing
> the files to the disk.
>

Okay, so let's retain the original approach then. I have made a few
cosmetic modifications in the first two patches which include updating
docs, comments, slightly modify the commit message, and change the
code to match the nearby code. One change which you might have a
different opinion is below:

+ case WAIT_EVENT_LOGICAL_CHANGES_READ:
+ event_name = "ReorderLogicalChangesRead";
+ break;
+ case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+ event_name = "ReorderLogicalChangesWrite";
+ break;
+ case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+ event_name = "ReorderLogicalSubxactRead";
+ break;
+ case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+ event_name = "ReorderLogicalSubxactWrite";
+ break;

Why do we want to name these events starting with name as Reorder*? I
think these are used in subscriber-side, so no need to use the word
Reorder, so I have removed it from the attached patch. I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Aug 25, 2020 at 6:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 25, 2020 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > I think the existing design is superior as it allows the flexibility
> > > to create transaction files in different temp_tablespaces which is
> > > quite important to consider as we know the files will be created only
> > > for large transactions. Once we fix the sharedfileset for a worker all
> > > the files will be created in the temp_tablespaces chosen for the first
> > > time apply worker creates it even if it got changed at some later
> > > point of time (user can change its value and then do reload config
> > > which I think will impact the worker settings as well). This all can
> > > happen because we set the tablespaces at the time of
> > > SharedFileSetInit.
> >
> > Yeah, I agree with this point,  that if we use the single shared
> > fileset then it will always use the same tablespace for all the
> > streaming transactions.  And, we might get the benefit of concurrent
> > I/O if we use different tablespaces as we are not immediately flushing
> > the files to the disk.
> >
>
> Okay, so let's retain the original approach then. I have made a few
> cosmetic modifications in the first two patches which include updating
> docs, comments, slightly modify the commit message, and change the
> code to match the nearby code. One change which you might have a
> different opinion is below:
>
> + case WAIT_EVENT_LOGICAL_CHANGES_READ:
> + event_name = "ReorderLogicalChangesRead";
> + break;
> + case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
> + event_name = "ReorderLogicalChangesWrite";
> + break;
> + case WAIT_EVENT_LOGICAL_SUBXACT_READ:
> + event_name = "ReorderLogicalSubxactRead";
> + break;
> + case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
> + event_name = "ReorderLogicalSubxactWrite";
> + break;
>
> Why do we want to name these events starting with name as Reorder*? I
> think these are used in subscriber-side, so no need to use the word
> Reorder, so I have removed it from the attached patch. I am planning
> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
> this series tomorrow unless you have any comments on the same.

Your changes in 0001 and 0002,  looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Jeff Janes
Дата:

On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
 
 I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.


I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:

bool        found PG_USED_FOR_ASSERTS_ONLY = false;

Cheers,

Jeff

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:
>
>
> On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>
>>  I am planning
>> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
>> this series tomorrow unless you have any comments on the same.
>
>
>
> I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:
>
> bool        found PG_USED_FOR_ASSERTS_ONLY = false;
>

Thanks for the report. Tom Lane has already fixed this [1].

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:
> >
> >
> > On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >>
> >>  I am planning
> >> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
> >> this series tomorrow unless you have any comments on the same.
> >
> >
> >
> > I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:
> >
> > bool        found PG_USED_FOR_ASSERTS_ONLY = false;
> >
>
> Thanks for the report. Tom Lane has already fixed this [1].
>
> [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

As discussed, I have added a another test case for covering the out of
order subtransaction rollback scenario.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Neha Sharma
Дата:
Hi,

I have done code coverage analysis on the latest patches(v53) and below is the report for the same.
Highlighted are the files where the coverage modifications were observed.

OS: Ubuntu 18.04
Patch applied on commit : 77c7267c37f7fa8e5e48abda4798afdbecb2b95a

File Name
Coverage
Without logical decoding patchOn v53 (2,3,4,5) patch Without v53-0003 patch
%Line%Function%Line%Function%Line%Function
src/backend/access/transam/xact.c86.292.986.292.986.292.9
src/backend/access/transam/xloginsert.c90.294.190.294.190.294.1
src/backend/access/transam/xlogreader.c73.393.373.893.373.893.3
src/backend/replication/logical/decode.c93.410093.410093.4100
src/backend/access/rmgrdesc/xactdesc.c54.463.654.463.654.463.6
src/backend/replication/logical/reorderbuffer.c93.496.793.496.793.496.7
src/backend/utils/cache/inval.c98.110098.110098.1100
contrib/test_decoding/test_decoding.c86.895.286.895.286.895.2
src/backend/replication/logical/logical.c90.993.590.993.591.893.5
src/backend/access/heap/heapam.c86.194.586.194.586.194.5
src/backend/access/index/genam.c90.791.791.291.791.291.7
src/backend/access/table/tableam.c90.610090.610090.6100
src/backend/utils/time/snapmgr.c81.198.180.298.181.198.1
src/include/access/tableam.h92.510092.510092.5100
src/backend/access/heap/heapam_visibility.c77.810077.810077.8100
src/backend/replication/walsender.c90.597.890.597.890.9100
src/backend/catalog/pg_subscription.c961009610096100
src/backend/commands/subscriptioncmds.c93.29092.79092.790
src/backend/postmaster/pgstat.c64.285.163.985.164.686.1
src/backend/replication/libpqwalreceiver/libpqwalreceiver.c82.49582.59583.695
src/backend/replication/logical/proto.c93.591.393.793.393.793.3
src/backend/replication/logical/worker.c91.69691.597.491.997.4
src/backend/replication/pgoutput/pgoutput.c81.910085.510086.2100
src/backend/replication/slotfuncs.c9393.89393.89393.8
src/include/pgstat.h100-100-100-
src/backend/replication/logical/logicalfuncs.c87.19087.19087.190
src/backend/storage/file/buffile.c68.38569.68569.685
src/backend/storage/file/fd.c81.19381.19381.193
src/backend/storage/file/sharedfileset.c77.790.993.210093.2100
src/backend/utils/sort/logtape.c94.410094.410094.4100
src/backend/utils/sort/sharedtuplestore.c90.190.990.190.990.190.9

Thanks.
--
Regards,
Neha Sharma


On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:
>
>
> On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>
>>  I am planning
>> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
>> this series tomorrow unless you have any comments on the same.
>
>
>
> I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:
>
> bool        found PG_USED_FOR_ASSERTS_ONLY = false;
>

Thanks for the report. Tom Lane has already fixed this [1].

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

--
With Regards,
Amit Kapila.


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> As discussed, I have added a another test case for covering the out of
> order subtransaction rollback scenario.
>

+# large (streamed) transaction with out of order subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{

How about writing a comment as: "large (streamed) transaction with
subscriber receiving out of order subtransaction ROLLBACKs"?

I have reviewed and modified the number of things in the attached patch:
1. In apply_handle_origin, improved the check streamed xacts.
2. In apply_handle_stream_commit() while applying changes in the loop,
added CHECK_FOR_INTERRUPTS.
3. In DEBUG messages, print the path with double-quotes as we are
doing in all other places.
4.
+ /*
+ * Exit if streaming option is changed. The launcher will start new
+ * worker.
+ */
+ if (newsub->stream != MySubscription->stream)
+ {
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\" will "
+ "restart because subscription's streaming option were changed",
+ MySubscription->name)));
+
+ proc_exit(0);
+ }
+
We don't need a separate check like this. I have merged this into one
of the existing checks.
5.
subxact_info_write()
{
..
+ if (subxact_data.nsubxacts == 0)
+ {
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ pfree(ent->subxact_fileset);
+ ent->subxact_fileset = NULL;
+ }

I don't think it is right to use BufFileDeleteShared interface here
because it won't perform SharedFileSetUnregister which means if after
above code execution is the server exits it will crash in
SharedFileSetDeleteOnProcExit which will try to access already deleted
fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead.
The another related problem is that in function
SharedFileSetDeleteOnProcExit, it tries to delete the list element
while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable. I have fixed this in a
separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you
are fine with this, I would like to commit this as this fixes a
problem in the existing commit 808e13b282.
6. Function stream_cleanup_files() contains a missing_ok argument
which is not used so removed it.
7. In pgoutput.c, change the ordering of functions to make them
consistent with their declaration.
8.
typedef struct RelationSyncEntry
 {
  Oid relid; /* relation oid */
+ TransactionId xid; /* transaction that created the record */

Removed above parameter as this doesn't seem to be required as per the
new design in the patch.

Apart from above, I have added/changed quite a few comments and a few
other cosmetic changes. Kindly review and let me know what do you
think about the changes?

One more comment for which I haven't done anything yet.
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);

Is it a good idea to append xid with lappend_int? Won't we need
something equivalent for uint32? If so, I think we have a couple of
options (a) use lcons method and accordingly append the pointer to
xid, I think we need to allocate memory for xid if we want to use this
idea or (b) use an array instead. What do you think?

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > As discussed, I have added a another test case for covering the out of
> > order subtransaction rollback scenario.
> >
>
> +# large (streamed) transaction with out of order subtransaction ROLLBACKs
> +$node_publisher->safe_psql('postgres', q{
>
> How about writing a comment as: "large (streamed) transaction with
> subscriber receiving out of order subtransaction ROLLBACKs"?

I have fixed and merged with 0002.

> I have reviewed and modified the number of things in the attached patch:
> 1. In apply_handle_origin, improved the check streamed xacts.
> 2. In apply_handle_stream_commit() while applying changes in the loop,
> added CHECK_FOR_INTERRUPTS.
> 3. In DEBUG messages, print the path with double-quotes as we are
> doing in all other places.
> 4.
> + /*
> + * Exit if streaming option is changed. The launcher will start new
> + * worker.
> + */
> + if (newsub->stream != MySubscription->stream)
> + {
> + ereport(LOG,
> + (errmsg("logical replication apply worker for subscription \"%s\" will "
> + "restart because subscription's streaming option were changed",
> + MySubscription->name)));
> +
> + proc_exit(0);
> + }
> +
> We don't need a separate check like this. I have merged this into one
> of the existing checks.
> 5.
> subxact_info_write()
> {
> ..
> + if (subxact_data.nsubxacts == 0)
> + {
> + if (ent->subxact_fileset)
> + {
> + cleanup_subxact_info();
> + BufFileDeleteShared(ent->subxact_fileset, path);
> + pfree(ent->subxact_fileset);
> + ent->subxact_fileset = NULL;
> + }
>
> I don't think it is right to use BufFileDeleteShared interface here
> because it won't perform SharedFileSetUnregister which means if after
> above code execution is the server exits it will crash in
> SharedFileSetDeleteOnProcExit which will try to access already deleted
> fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead.
> The another related problem is that in function
> SharedFileSetDeleteOnProcExit, it tries to delete the list element
> while traversing the list with 'foreach' construct which makes the
> behavior of list traversal unpredictable. I have fixed this in a
> separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you
> are fine with this, I would like to commit this as this fixes a
> problem in the existing commit 808e13b282.
> 6. Function stream_cleanup_files() contains a missing_ok argument
> which is not used so removed it.
> 7. In pgoutput.c, change the ordering of functions to make them
> consistent with their declaration.
> 8.
> typedef struct RelationSyncEntry
>  {
>   Oid relid; /* relation oid */
> + TransactionId xid; /* transaction that created the record */
>
> Removed above parameter as this doesn't seem to be required as per the
> new design in the patch.
>
> Apart from above, I have added/changed quite a few comments and a few
> other cosmetic changes. Kindly review and let me know what do you
> think about the changes?

I have reviewed your changes and look fine to me.  And the bug fix in
0001 also looks fine.

> One more comment for which I haven't done anything yet.
> +static void
> +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
> +{
> + MemoryContext oldctx;
> +
> + oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> +
> + entry->streamed_txns = lappend_int(entry->streamed_txns, xid);

> Is it a good idea to append xid with lappend_int? Won't we need
> something equivalent for uint32? If so, I think we have a couple of
> options (a) use lcons method and accordingly append the pointer to
> xid, I think we need to allocate memory for xid if we want to use this
> idea or (b) use an array instead. What do you think?

BTW, OID is internally mapped to uint32,  but using lappend_oid might
not look good.  So maybe we can provide an option for lappend_uint32?
Using an array is also not a bad idea.  Providing lappend_uint32
option looks more appealing to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > One more comment for which I haven't done anything yet.
> > +static void
> > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
> > +{
> > + MemoryContext oldctx;
> > +
> > + oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> > +
> > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
>
> > Is it a good idea to append xid with lappend_int? Won't we need
> > something equivalent for uint32? If so, I think we have a couple of
> > options (a) use lcons method and accordingly append the pointer to
> > xid, I think we need to allocate memory for xid if we want to use this
> > idea or (b) use an array instead. What do you think?
>
> BTW, OID is internally mapped to uint32,  but using lappend_oid might
> not look good.  So maybe we can provide an option for lappend_uint32?
> Using an array is also not a bad idea.  Providing lappend_uint32
> option looks more appealing to me.
>

I thought about this again and I feel it might be okay to use it for
our case as after storing it in T_IntList, we primarily fetch it for
comparison with TrasnactionId (uint32), so this shouldn't create any
problem. I feel we can just discuss this in a separate thread and
check the opinion of others, what do you think?

Another comment:

+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+ HASH_SEQ_STATUS hash_seq;
+ RelationSyncEntry *entry;
+
+ Assert(RelationSyncCache != NULL);
+
+ hash_seq_init(&hash_seq, RelationSyncCache);
+ while ((entry = hash_seq_search(&hash_seq)) != NULL)
+ {
+ if (is_commit)
+ entry->schema_sent = true;

How is it correct to set 'entry->schema_sent' for all the entries in
RelationSyncCache? Consider a case where due to invalidation in an
unrelated transaction we have set the flag schema_sent for a
particular relation 'r1' as 'false' and that transaction is executed
before the current streamed transaction for which we are performing
commit and called this function. It will set the flag for unrelated
entry in this case 'r1' which doesn't seem correct to me. Or, if this
is correct, it would be a good idea to write some comments about it.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> Another comment:
>
> +cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
> +{
> + HASH_SEQ_STATUS hash_seq;
> + RelationSyncEntry *entry;
> +
> + Assert(RelationSyncCache != NULL);
> +
> + hash_seq_init(&hash_seq, RelationSyncCache);
> + while ((entry = hash_seq_search(&hash_seq)) != NULL)
> + {
> + if (is_commit)
> + entry->schema_sent = true;
>
> How is it correct to set 'entry->schema_sent' for all the entries in
> RelationSyncCache? Consider a case where due to invalidation in an
> unrelated transaction we have set the flag schema_sent for a
> particular relation 'r1' as 'false' and that transaction is executed
> before the current streamed transaction for which we are performing
> commit and called this function. It will set the flag for unrelated
> entry in this case 'r1' which doesn't seem correct to me. Or, if this
> is correct, it would be a good idea to write some comments about it.
>

Few more comments:
1.
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub"
+);

In most of the tests, we are using the above statement to create a
subscription. Don't we need (streaming = 'on') parameter while
creating a subscription? Is there a reason for not doing so in this
patch itself?

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

4. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?

Apart from the above, I have made a few changes in the attached patch
which are mainly to simplify the code at one place, added/edited few
comments, some other cosmetic changes, and renamed the test case files
as the initials of their name were matching other tests in the similar
directory.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> 2.
> 009_stream_simple.pl
> +# Insert, update and delete enough rows to exceed the 64kB limit.
> +$node_publisher->safe_psql('postgres', q{
> +BEGIN;
> +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> +DELETE FROM test_tab WHERE mod(a,3) = 0;
> +COMMIT;
> +});
>
> How much above this data is 64kB limit? I just wanted to see that it
> should not be on borderline and then due to some alignment issues the
> streaming doesn't happen on some machines?
>

I think we should find similar information for other tests added by
the patch as well.

Few other comments:
===================
+sub wait_for_caught_up
+{
+ my ($node, $appname) = @_;
+
+ $node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication
WHERE application_name = '$appname';"
+ ) or die "Timed ou

The patch has added this in all the test files if it is used in so
many tests then we need to add this in some generic place
(PostgresNode.pm) but actually, I am not sure if need this at all. Why
can't the existing wait_for_catchup in PostgresNode.pm serve the same
purpose.

2.
In system_views.sql,

-- All columns of pg_subscription except subconninfo are readable.
REVOKE ALL ON pg_subscription FROM public;
GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary,
subslotname, subpublications)
    ON pg_subscription TO public;

Here, we need to update for substream column as well.

3. Update describeSubscriptions() to show the 'substream' value in \dRs.

4. Also, lets add few tests in subscription.sql as we have added
'binary' option in commit 9de77b5453.

5. I think we can merge pg_dump related changes (the last version
posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in
the main patch, one minor comment on pg_dump related changes
@@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
  if (strcmp(subinfo->subbinary, "t") == 0)
  appendPQExpBuffer(query, ", binary = true");

+ if (strcmp(subinfo->substream, "f") != 0)
+ appendPQExpBuffer(query, ", streaming = on");
  if (strcmp(subinfo->subsynccommit, "off") != 0)
  appendPQExpBuffer(query, ", synchronous_commit = %s",
fmtId(subinfo->subsynccommit));

Keep one line space between substream and subsynccommit option code to
keep it consistent with nearby code.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> >
> > Another comment:
> >
> > +cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
> > +{
> > + HASH_SEQ_STATUS hash_seq;
> > + RelationSyncEntry *entry;
> > +
> > + Assert(RelationSyncCache != NULL);
> > +
> > + hash_seq_init(&hash_seq, RelationSyncCache);
> > + while ((entry = hash_seq_search(&hash_seq)) != NULL)
> > + {
> > + if (is_commit)
> > + entry->schema_sent = true;
> >
> > How is it correct to set 'entry->schema_sent' for all the entries in
> > RelationSyncCache? Consider a case where due to invalidation in an
> > unrelated transaction we have set the flag schema_sent for a
> > particular relation 'r1' as 'false' and that transaction is executed
> > before the current streamed transaction for which we are performing
> > commit and called this function. It will set the flag for unrelated
> > entry in this case 'r1' which doesn't seem correct to me. Or, if this
> > is correct, it would be a good idea to write some comments about it.

Yeah, this is wrong,  I have fixed this issue in the attached patch
and also added a new test for the same.

> Few more comments:
> 1.
> +my $appname = 'tap_sub';
> +$node_subscriber->safe_psql('postgres',
> +"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
> application_name=$appname' PUBLICATION tap_pub"
> +);
>
> In most of the tests, we are using the above statement to create a
> subscription. Don't we need (streaming = 'on') parameter while
> creating a subscription? Is there a reason for not doing so in this
> patch itself?

I have changed this.

> 2.
> 009_stream_simple.pl
> +# Insert, update and delete enough rows to exceed the 64kB limit.
> +$node_publisher->safe_psql('postgres', q{
> +BEGIN;
> +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> +DELETE FROM test_tab WHERE mod(a,3) = 0;
> +COMMIT;
> +});
>
> How much above this data is 64kB limit? I just wanted to see that it
> should not be on borderline and then due to some alignment issues the
> streaming doesn't happen on some machines? Also, how such a test
> ensures that the streaming has happened because the way we are
> checking results, won't it be the same for the non-streaming case as
> well?

Only for this case, or you mean for all the tests?

> 3.
> +# Change the local values of the extra columns on the subscriber,
> +# update publisher, and check that subscriber retains the expected
> +# values
> +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> 'epoch'::timestamptz + 987654321 * interval '1s'");
> +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> +
> +wait_for_caught_up($node_publisher, $appname);
> +
> +$result =
> +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> count(extract(epoch from c) = 987654321), count(d = 999) FROM
> test_tab");
> +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> changed data');
>
> Again, how this test is relevant to streaming mode?

I agree, it is not specific to the streaming.


> 4. I have checked that in one of the previous patches, we have a test
> v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
> quite similar to what we have in
> v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
> If there is any difference that can cover more scenarios then can we
> consider merging them into one test?

I will have a look.

> Apart from the above, I have made a few changes in the attached patch
> which are mainly to simplify the code at one place, added/edited few
> comments, some other cosmetic changes, and renamed the test case files
> as the initials of their name were matching other tests in the similar
> directory.

Changes look fine to me except this

+

+ /* the value must be on/off */
+ if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid streaming value")));
+
+ /* enable streaming if it's 'on' */
+ *enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);

I mean for streaming why we need to handle differently than the other
surrounding code for example "binary" option.

Apart from that for testing 0001, I have added a new test in the
attached contrib.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Neha Sharma
Дата:
Hi Amit/Dilip,

I have tested a few scenarios on  top of the v56 patches, where the replication worker still had few subtransactions in uncommitted state and we restart the publisher server.
No crash or data discrepancies were observed, attached are the test scenarios verified.

Data Setup:
Publication Server postgresql.conf :
echo "wal_level = logical
max_wal_senders = 10
max_replication_slots = 15
wal_log_hints = on
hot_standby_feedback = on
wal_receiver_status_interval = 1
listen_addresses='*'
log_min_messages=debug1
wal_sender_timeout = 0
logical_decoding_work_mem=64kB

Subscription Server postgresql.conf :
wal_level = logical
max_wal_senders = 10
max_replication_slots = 15
wal_log_hints = on
hot_standby_feedback = on
wal_receiver_status_interval = 1
listen_addresses='*'
log_min_messages=debug1
wal_sender_timeout = 0
logical_decoding_work_mem=64kB
port=5433

Initial setup:
Publication Server:
create table t(a int PRIMARY KEY ,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
create publication test_pub for table t with(PUBLISH='insert,delete,update,truncate');
alter table t replica identity FULL ;
insert into t values (generate_series(1,20),large_val()) ON CONFLICT (a) DO UPDATE SET a=EXCLUDED.a*300;

Subscription server:
 create table t(a int,b text);
 create subscription test_sub CONNECTION 'host=localhost port=5432 dbname=postgres user=edb' PUBLICATION test_pub WITH ( slot_name = test_slot_sub1,streaming=on);

Thanks.
--
Regards,
Neha Sharma


On Mon, Aug 31, 2020 at 1:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> Another comment:
>
> +cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
> +{
> + HASH_SEQ_STATUS hash_seq;
> + RelationSyncEntry *entry;
> +
> + Assert(RelationSyncCache != NULL);
> +
> + hash_seq_init(&hash_seq, RelationSyncCache);
> + while ((entry = hash_seq_search(&hash_seq)) != NULL)
> + {
> + if (is_commit)
> + entry->schema_sent = true;
>
> How is it correct to set 'entry->schema_sent' for all the entries in
> RelationSyncCache? Consider a case where due to invalidation in an
> unrelated transaction we have set the flag schema_sent for a
> particular relation 'r1' as 'false' and that transaction is executed
> before the current streamed transaction for which we are performing
> commit and called this function. It will set the flag for unrelated
> entry in this case 'r1' which doesn't seem correct to me. Or, if this
> is correct, it would be a good idea to write some comments about it.
>

Few more comments:
1.
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub"
+);

In most of the tests, we are using the above statement to create a
subscription. Don't we need (streaming = 'on') parameter while
creating a subscription? Is there a reason for not doing so in this
patch itself?

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

4. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?

Apart from the above, I have made a few changes in the attached patch
which are mainly to simplify the code at one place, added/edited few
comments, some other cosmetic changes, and renamed the test case files
as the initials of their name were matching other tests in the similar
directory.

--
With Regards,
Amit Kapila.
Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > >
> > > Another comment:
> > >
> > > +cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
> > > +{
> > > + HASH_SEQ_STATUS hash_seq;
> > > + RelationSyncEntry *entry;
> > > +
> > > + Assert(RelationSyncCache != NULL);
> > > +
> > > + hash_seq_init(&hash_seq, RelationSyncCache);
> > > + while ((entry = hash_seq_search(&hash_seq)) != NULL)
> > > + {
> > > + if (is_commit)
> > > + entry->schema_sent = true;
> > >
> > > How is it correct to set 'entry->schema_sent' for all the entries in
> > > RelationSyncCache? Consider a case where due to invalidation in an
> > > unrelated transaction we have set the flag schema_sent for a
> > > particular relation 'r1' as 'false' and that transaction is executed
> > > before the current streamed transaction for which we are performing
> > > commit and called this function. It will set the flag for unrelated
> > > entry in this case 'r1' which doesn't seem correct to me. Or, if this
> > > is correct, it would be a good idea to write some comments about it.
>
> Yeah, this is wrong,  I have fixed this issue in the attached patch
> and also added a new test for the same.
>

In functions cleanup_rel_sync_cache and
get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to
uint32 as suggested by Tom [1]. Also, lets keep the way we compare
xids consistent in both functions, i.e, if (xid == lfirst_int(lc)).

The behavior tested by the test case added for this is not clear
primarily because of comments.

+++ b/src/test/subscription/t/021_stream_schema.pl
@@ -0,0 +1,80 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
...
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM
generate_series(3,3000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+COMMIT;
+});
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM
generate_series(3001,3005) s(i);
+COMMIT;
+});
+wait_for_caught_up($node_publisher, $appname);

I understand that how this test will test the functionality related to
schema_sent stuff but neither the comments atop of file nor atop the
test case explains it clearly.

> > Few more comments:

>
> > 2.
> > 009_stream_simple.pl
> > +# Insert, update and delete enough rows to exceed the 64kB limit.
> > +$node_publisher->safe_psql('postgres', q{
> > +BEGIN;
> > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> > +DELETE FROM test_tab WHERE mod(a,3) = 0;
> > +COMMIT;
> > +});
> >
> > How much above this data is 64kB limit? I just wanted to see that it
> > should not be on borderline and then due to some alignment issues the
> > streaming doesn't happen on some machines? Also, how such a test
> > ensures that the streaming has happened because the way we are
> > checking results, won't it be the same for the non-streaming case as
> > well?
>
> Only for this case, or you mean for all the tests?
>

It is better to do it for all tests and I have clarified this in my
next email sent yesterday [2] where I have raised a few more comments
as well. I hope you have not missed that email.

> > 3.
> > +# Change the local values of the extra columns on the subscriber,
> > +# update publisher, and check that subscriber retains the expected
> > +# values
> > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> > 'epoch'::timestamptz + 987654321 * interval '1s'");
> > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> > +
> > +wait_for_caught_up($node_publisher, $appname);
> > +
> > +$result =
> > +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> > count(extract(epoch from c) = 987654321), count(d = 999) FROM
> > test_tab");
> > +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> > changed data');
> >
> > Again, how this test is relevant to streaming mode?
>
> I agree, it is not specific to the streaming.
>

> > Apart from the above, I have made a few changes in the attached patch
> > which are mainly to simplify the code at one place, added/edited few
> > comments, some other cosmetic changes, and renamed the test case files
> > as the initials of their name were matching other tests in the similar
> > directory.
>
> Changes look fine to me except this
>
> +
>
> + /* the value must be on/off */
> + if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid streaming value")));
> +
> + /* enable streaming if it's 'on' */
> + *enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
>
> I mean for streaming why we need to handle differently than the other
> surrounding code for example "binary" option.
>

Hmm, I think the code changed by me is to make it look similar to the
binary option. The code you have quoted above is from the patch
version prior to what I have sent. See the code snippet after my
changes:
@@ -182,6 +222,16 @@ parse_output_parameters(List *options, uint32
*protocol_version,

  *binary = defGetBoolean(defel);
  }
+ else if (strcmp(defel->defname, "streaming") == 0)
+ {
+ if (streaming_given)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting or redundant options")));
+ streaming_given = true;
+
+ *enable_streaming = defGetBoolean(defel);
+ }

This looks exactly similar to the binary option. Can you please check
it once again and confirm back?

[1] - https://www.postgresql.org/message-id/3955127.1598880523%40sss.pgh.pa.us
[2] - https://www.postgresql.org/message-id/CAA4eK1JjrcK6bk%2Bur3J%2BkLsfz4%2BipJFN7VcRd3cXr4gG5ZWWig%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Aug 31, 2020 at 10:27 PM Neha Sharma
<neha.sharma@enterprisedb.com> wrote:
>
> Hi Amit/Dilip,
>
> I have tested a few scenarios on  top of the v56 patches, where the replication worker still had few subtransactions
inuncommitted state and we restart the publisher server.
 
> No crash or data discrepancies were observed, attached are the test scenarios verified.
>

Thanks, I have pushed the fix
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4ab77697f67aa5b90b032b9175b46901859da6d7).

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> In functions cleanup_rel_sync_cache and
> get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to
> uint32 as suggested by Tom [1]. Also, lets keep the way we compare
> xids consistent in both functions, i.e, if (xid == lfirst_int(lc)).
>

Fixed this in the attached patch.

> The behavior tested by the test case added for this is not clear
> primarily because of comments.
>
> +++ b/src/test/subscription/t/021_stream_schema.pl
> @@ -0,0 +1,80 @@
> +# Test behavior with streaming transaction exceeding logical_decoding_work_mem
> ...
> +# large (streamed) transaction with DDL, DML and ROLLBACKs
> +$node_publisher->safe_psql('postgres', q{
> +BEGIN;
> +ALTER TABLE test_tab ADD COLUMN c INT;
> +INSERT INTO test_tab SELECT i, md5(i::text), i FROM
> generate_series(3,3000) s(i);
> +ALTER TABLE test_tab ADD COLUMN d INT;
> +COMMIT;
> +});
> +
> +# large (streamed) transaction with DDL, DML and ROLLBACKs
> +$node_publisher->safe_psql('postgres', q{
> +BEGIN;
> +INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM
> generate_series(3001,3005) s(i);
> +COMMIT;
> +});
> +wait_for_caught_up($node_publisher, $appname);
>
> I understand that how this test will test the functionality related to
> schema_sent stuff but neither the comments atop of file nor atop the
> test case explains it clearly.
>

Added comments for this test.

> > > Few more comments:
>
> >
> > > 2.
> > > 009_stream_simple.pl
> > > +# Insert, update and delete enough rows to exceed the 64kB limit.
> > > +$node_publisher->safe_psql('postgres', q{
> > > +BEGIN;
> > > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> > > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> > > +DELETE FROM test_tab WHERE mod(a,3) = 0;
> > > +COMMIT;
> > > +});
> > >
> > > How much above this data is 64kB limit? I just wanted to see that it
> > > should not be on borderline and then due to some alignment issues the
> > > streaming doesn't happen on some machines? Also, how such a test
> > > ensures that the streaming has happened because the way we are
> > > checking results, won't it be the same for the non-streaming case as
> > > well?
> >
> > Only for this case, or you mean for all the tests?
> >
>

I have not done this yet.

> It is better to do it for all tests and I have clarified this in my
> next email sent yesterday [2] where I have raised a few more comments
> as well. I hope you have not missed that email.
>
> > > 3.
> > > +# Change the local values of the extra columns on the subscriber,
> > > +# update publisher, and check that subscriber retains the expected
> > > +# values
> > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> > > 'epoch'::timestamptz + 987654321 * interval '1s'");
> > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> > > +
> > > +wait_for_caught_up($node_publisher, $appname);
> > > +
> > > +$result =
> > > +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> > > count(extract(epoch from c) = 987654321), count(d = 999) FROM
> > > test_tab");
> > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> > > changed data');
> > >
> > > Again, how this test is relevant to streaming mode?
> >
> > I agree, it is not specific to the streaming.
> >

I think we can leave this as of now. After committing the stats
patches by Sawada-San and Ajin, we might be able to improve this test.

> +sub wait_for_caught_up
> +{
> + my ($node, $appname) = @_;
> +
> + $node->poll_query_until('postgres',
> +"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication
> WHERE application_name = '$appname';"
> + ) or die "Timed ou
>
> The patch has added this in all the test files if it is used in so
> many tests then we need to add this in some generic place
> (PostgresNode.pm) but actually, I am not sure if need this at all. Why
> can't the existing wait_for_catchup in PostgresNode.pm serve the same
> purpose.
>

Changed as per this suggestion.

> 2.
> In system_views.sql,
>
> -- All columns of pg_subscription except subconninfo are readable.
> REVOKE ALL ON pg_subscription FROM public;
> GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary,
> subslotname, subpublications)
>     ON pg_subscription TO public;
>
> Here, we need to update for substream column as well.
>

Fixed.

> 3. Update describeSubscriptions() to show the 'substream' value in \dRs.
>
> 4. Also, lets add few tests in subscription.sql as we have added
> 'binary' option in commit 9de77b5453.
>

Fixed both the above comments.

> 5. I think we can merge pg_dump related changes (the last version
> posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in
> the main patch, one minor comment on pg_dump related changes
> @@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
>   if (strcmp(subinfo->subbinary, "t") == 0)
>   appendPQExpBuffer(query, ", binary = true");
>
> + if (strcmp(subinfo->substream, "f") != 0)
> + appendPQExpBuffer(query, ", streaming = on");
>   if (strcmp(subinfo->subsynccommit, "off") != 0)
>   appendPQExpBuffer(query, ", synchronous_commit = %s",
> fmtId(subinfo->subsynccommit));
>
> Keep one line space between substream and subsynccommit option code to
> keep it consistent with nearby code.
>

Changed as per this suggestion.

I have fixed all the comments except the below comments.
1. verify the size of various tests to ensure that it is above
logical_decoding_work_mem.
2. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?
3. +# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?
4. Apart from the above, I think we should think of minimizing the
test cases which can be committed with the base patch. We can later
add more tests.

Kindly verify the changes.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have fixed all the comments except
..
> 3. +# Change the local values of the extra columns on the subscriber,
> +# update publisher, and check that subscriber retains the expected
> +# values
> +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> 'epoch'::timestamptz + 987654321 * interval '1s'");
> +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> +
> +wait_for_caught_up($node_publisher, $appname);
> +
> +$result =
> +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> count(extract(epoch from c) = 987654321), count(d = 999) FROM
> test_tab");
> +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> changed data');
>
> Again, how this test is relevant to streaming mode?
>

I think we can keep this test in one of the newly added tests say in
015_stream_simple.pl to ensure that after streaming transaction, the
non-streaming one behaves expectedly. So we can change the comment as
"Change the local values of the extra columns on the subscriber,
update publisher, and check that subscriber retains the expected
values. This is to ensure that non-streaming transactions behave
properly after a streaming transaction."

We can remove this test from the other two places
016_stream_subxact.pl and 020_stream_binary.pl.

> 4. Apart from the above, I think we should think of minimizing the
> test cases which can be committed with the base patch. We can later
> add more tests.
>

We can combine the tests in 015_stream_simple.pl and
020_stream_binary.pl as I can't see a good reason to keep them
separate. Then, I think we can keep only this part with the main patch
and extract other tests into a separate patch. Basically, we can
commit the basic tests with the main patch and then keep the advanced
tests separately. I am afraid that there are some tests that don't add
much value so we can review them separately.

One minor comment for option 'streaming = on', spacing-wise it should
be consistent in all the tests.

Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl
as both contains similar tests. As per the above suggestion, this will
be in a separate patch though.

If you agree with the above suggestions then kindly make these
adjustments and send the updated patch.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have fixed all the comments except
> ..
> > 3. +# Change the local values of the extra columns on the subscriber,
> > +# update publisher, and check that subscriber retains the expected
> > +# values
> > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> > 'epoch'::timestamptz + 987654321 * interval '1s'");
> > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> > +
> > +wait_for_caught_up($node_publisher, $appname);
> > +
> > +$result =
> > +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> > count(extract(epoch from c) = 987654321), count(d = 999) FROM
> > test_tab");
> > +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> > changed data');
> >
> > Again, how this test is relevant to streaming mode?
> >
>
> I think we can keep this test in one of the newly added tests say in
> 015_stream_simple.pl to ensure that after streaming transaction, the
> non-streaming one behaves expectedly. So we can change the comment as
> "Change the local values of the extra columns on the subscriber,
> update publisher, and check that subscriber retains the expected
> values. This is to ensure that non-streaming transactions behave
> properly after a streaming transaction."
>
> We can remove this test from the other two places
> 016_stream_subxact.pl and 020_stream_binary.pl.
>
> > 4. Apart from the above, I think we should think of minimizing the
> > test cases which can be committed with the base patch. We can later
> > add more tests.
> >
>
> We can combine the tests in 015_stream_simple.pl and
> 020_stream_binary.pl as I can't see a good reason to keep them
> separate. Then, I think we can keep only this part with the main patch
> and extract other tests into a separate patch. Basically, we can
> commit the basic tests with the main patch and then keep the advanced
> tests separately. I am afraid that there are some tests that don't add
> much value so we can review them separately.

Fixed

> One minor comment for option 'streaming = on', spacing-wise it should
> be consistent in all the tests.
>
> Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl
> as both contains similar tests. As per the above suggestion, this will
> be in a separate patch though.
>
> If you agree with the above suggestions then kindly make these
> adjustments and send the updated patch.

Done that way.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > In functions cleanup_rel_sync_cache and
> > get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to
> > uint32 as suggested by Tom [1]. Also, lets keep the way we compare
> > xids consistent in both functions, i.e, if (xid == lfirst_int(lc)).
> >
>
> Fixed this in the attached patch.
>
> > The behavior tested by the test case added for this is not clear
> > primarily because of comments.
> >
> > +++ b/src/test/subscription/t/021_stream_schema.pl
> > @@ -0,0 +1,80 @@
> > +# Test behavior with streaming transaction exceeding logical_decoding_work_mem
> > ...
> > +# large (streamed) transaction with DDL, DML and ROLLBACKs
> > +$node_publisher->safe_psql('postgres', q{
> > +BEGIN;
> > +ALTER TABLE test_tab ADD COLUMN c INT;
> > +INSERT INTO test_tab SELECT i, md5(i::text), i FROM
> > generate_series(3,3000) s(i);
> > +ALTER TABLE test_tab ADD COLUMN d INT;
> > +COMMIT;
> > +});
> > +
> > +# large (streamed) transaction with DDL, DML and ROLLBACKs
> > +$node_publisher->safe_psql('postgres', q{
> > +BEGIN;
> > +INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM
> > generate_series(3001,3005) s(i);
> > +COMMIT;
> > +});
> > +wait_for_caught_up($node_publisher, $appname);
> >
> > I understand that how this test will test the functionality related to
> > schema_sent stuff but neither the comments atop of file nor atop the
> > test case explains it clearly.
> >
>
> Added comments for this test.
>
> > > > Few more comments:
> >
> > >
> > > > 2.
> > > > 009_stream_simple.pl
> > > > +# Insert, update and delete enough rows to exceed the 64kB limit.
> > > > +$node_publisher->safe_psql('postgres', q{
> > > > +BEGIN;
> > > > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> > > > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> > > > +DELETE FROM test_tab WHERE mod(a,3) = 0;
> > > > +COMMIT;
> > > > +});
> > > >
> > > > How much above this data is 64kB limit? I just wanted to see that it
> > > > should not be on borderline and then due to some alignment issues the
> > > > streaming doesn't happen on some machines? Also, how such a test
> > > > ensures that the streaming has happened because the way we are
> > > > checking results, won't it be the same for the non-streaming case as
> > > > well?
> > >
> > > Only for this case, or you mean for all the tests?
> > >
> >
>
> I have not done this yet.
Most of the test cases are generating above 100kb and a few are around
72kb, Please find the test case wise data size.

015 - 200kb
016 - 150kb
017 - 72kb
018 - 72kb before first rollback to sb and total ~100kb
019 - 76kb before first rollback to sb and total ~100kb
020 - 150kb
021 - 100kb

> > It is better to do it for all tests and I have clarified this in my
> > next email sent yesterday [2] where I have raised a few more comments
> > as well. I hope you have not missed that email.

I saw that I think I replied to this before seeing that.

> > > > 3.
> > > > +# Change the local values of the extra columns on the subscriber,
> > > > +# update publisher, and check that subscriber retains the expected
> > > > +# values
> > > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> > > > 'epoch'::timestamptz + 987654321 * interval '1s'");
> > > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> > > > +
> > > > +wait_for_caught_up($node_publisher, $appname);
> > > > +
> > > > +$result =
> > > > +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> > > > count(extract(epoch from c) = 987654321), count(d = 999) FROM
> > > > test_tab");
> > > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> > > > changed data');
> > > >
> > > > Again, how this test is relevant to streaming mode?
> > >
> > > I agree, it is not specific to the streaming.
> > >
>
> I think we can leave this as of now. After committing the stats
> patches by Sawada-San and Ajin, we might be able to improve this test.

Make sense to me.

> > +sub wait_for_caught_up
> > +{
> > + my ($node, $appname) = @_;
> > +
> > + $node->poll_query_until('postgres',
> > +"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication
> > WHERE application_name = '$appname';"
> > + ) or die "Timed ou
> >
> > The patch has added this in all the test files if it is used in so
> > many tests then we need to add this in some generic place
> > (PostgresNode.pm) but actually, I am not sure if need this at all. Why
> > can't the existing wait_for_catchup in PostgresNode.pm serve the same
> > purpose.
> >
>
> Changed as per this suggestion.

Okay.

> > 2.
> > In system_views.sql,
> >
> > -- All columns of pg_subscription except subconninfo are readable.
> > REVOKE ALL ON pg_subscription FROM public;
> > GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary,
> > subslotname, subpublications)
> >     ON pg_subscription TO public;
> >
> > Here, we need to update for substream column as well.
> >
>
> Fixed.

 LGTM

> > 3. Update describeSubscriptions() to show the 'substream' value in \dRs.
> >
> > 4. Also, lets add few tests in subscription.sql as we have added
> > 'binary' option in commit 9de77b5453.
> >
>
> Fixed both the above comments.

Ok

> > 5. I think we can merge pg_dump related changes (the last version
> > posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in
> > the main patch, one minor comment on pg_dump related changes
> > @@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
> >   if (strcmp(subinfo->subbinary, "t") == 0)
> >   appendPQExpBuffer(query, ", binary = true");
> >
> > + if (strcmp(subinfo->substream, "f") != 0)
> > + appendPQExpBuffer(query, ", streaming = on");
> >   if (strcmp(subinfo->subsynccommit, "off") != 0)
> >   appendPQExpBuffer(query, ", synchronous_commit = %s",
> > fmtId(subinfo->subsynccommit));
> >
> > Keep one line space between substream and subsynccommit option code to
> > keep it consistent with nearby code.
> >
>
> Changed as per this suggestion.

Ok


> I have fixed all the comments except the below comments.
> 1. verify the size of various tests to ensure that it is above
> logical_decoding_work_mem.
> 2. I have checked that in one of the previous patches, we have a test
> v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
> quite similar to what we have in
> v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
> If there is any difference that can cover more scenarios then can we
> consider merging them into one test?
> 3. +# Change the local values of the extra columns on the subscriber,
> +# update publisher, and check that subscriber retains the expected
> +# values
> +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> 'epoch'::timestamptz + 987654321 * interval '1s'");
> +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> +
> +wait_for_caught_up($node_publisher, $appname);
> +
> +$result =
> +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> count(extract(epoch from c) = 987654321), count(d = 999) FROM
> test_tab");
> +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> changed data');
>
> Again, how this test is relevant to streaming mode?
> 4. Apart from the above, I think we should think of minimizing the
> test cases which can be committed with the base patch. We can later
> add more tests.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> >
> > We can combine the tests in 015_stream_simple.pl and
> > 020_stream_binary.pl as I can't see a good reason to keep them
> > separate. Then, I think we can keep only this part with the main patch
> > and extract other tests into a separate patch. Basically, we can
> > commit the basic tests with the main patch and then keep the advanced
> > tests separately. I am afraid that there are some tests that don't add
> > much value so we can review them separately.
>
> Fixed
>

I have slightly adjusted this test and ran pgindent on the patch. I am
planning to push this tomorrow unless you have more comments.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Sep 2, 2020 at 7:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> >
> > We can combine the tests in 015_stream_simple.pl and
> > 020_stream_binary.pl as I can't see a good reason to keep them
> > separate. Then, I think we can keep only this part with the main patch
> > and extract other tests into a separate patch. Basically, we can
> > commit the basic tests with the main patch and then keep the advanced
> > tests separately. I am afraid that there are some tests that don't add
> > much value so we can review them separately.
>
> Fixed
>

I have slightly adjusted this test and ran pgindent on the patch. I am
planning to push this tomorrow unless you have more comments.

Looks good to me. 

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
"Bossart, Nathan"
Дата:
I noticed a small compiler warning for this.

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 812aca8011..88d3444c39 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -199,7 +199,7 @@ typedef struct ApplySubXactData
 static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};

 static void subxact_filename(char *path, Oid subid, TransactionId xid);
-static void changes_filename(char *path, Oid subid, TransactionId xid);
+static inline void changes_filename(char *path, Oid subid, TransactionId xid);

 /*
  * Information about subtransactions of a given toplevel transaction.

Nathan


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Sep 4, 2020 at 3:10 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> I noticed a small compiler warning for this.
>
> diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
> index 812aca8011..88d3444c39 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -199,7 +199,7 @@ typedef struct ApplySubXactData
>  static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
>
>  static void subxact_filename(char *path, Oid subid, TransactionId xid);
> -static void changes_filename(char *path, Oid subid, TransactionId xid);
> +static inline void changes_filename(char *path, Oid subid, TransactionId xid);
>

Thanks for the report, I'll take care of this. I think the nearby
similar function subxact_filename() should also be inline.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have fixed all the comments except the below comments.
> 1. verify the size of various tests to ensure that it is above
> logical_decoding_work_mem.
> 2. I have checked that in one of the previous patches, we have a test
> v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
> quite similar to what we have in
> v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
> If there is any difference that can cover more scenarios then can we
> consider merging them into one test?
>

I have compared these two tests and found that the only thing
additional in the test case present in
v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing
few savepoints and DMLs after doing the first rollback to savepoint
and I included that in one of the existing tests in
018_stream_subxact_abort.pl. I have added one test for Rollback,
changed few messages, removed one test case which was not making any
sense in the patch. See attached and let me know what you think about
it?

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:


On Sat, 5 Sep 2020 at 4:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

>

> On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

>

> I have fixed all the comments except the below comments.

> 1. verify the size of various tests to ensure that it is above

> logical_decoding_work_mem.

> 2. I have checked that in one of the previous patches, we have a test

> v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case

> quite similar to what we have in

> v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.

> If there is any difference that can cover more scenarios then can we

> consider merging them into one test?

>



I have compared these two tests and found that the only thing

additional in the test case present in

v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing

few savepoints and DMLs after doing the first rollback to savepoint

and I included that in one of the existing tests in

018_stream_subxact_abort.pl. I have added one test for Rollback,

changed few messages, removed one test case which was not making any

sense in the patch. See attached and let me know what you think about

it?

I have reviewed the changes and looks fine to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> I have reviewed the changes and looks fine to me.
>

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports and then we can probably close this CF
entry. I am aware that we have one patch related to stats still
pending but I think we can tackle it along with the spill stats patch
which is being discussed in a different thread [1]. Do let me know if
I have missed anything?

[1] - https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> I have reviewed the changes and looks fine to me.
>

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports and then we can probably close this CF
entry.

Thanks.
 
I am aware that we have one patch related to stats still
pending but I think we can tackle it along with the spill stats patch
which is being discussed in a different thread [1]. Do let me know if
I have missed anything?

[1] - https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com

Sound good to me.
 
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Sep 7, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> >
>> > I have reviewed the changes and looks fine to me.
>> >
>>
>> Thanks, I have pushed the last patch. Let's wait for a day or so to
>> see the buildfarm reports and then we can probably close this CF
>> entry.
>
>
> Thanks.
>

I have updated the status of CF entry as committed now.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Tomas Vondra
Дата:
Hi,

while looking at the streaming code I noticed two minor issues:

1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary

2) minor typo in one of the comments

Patch attached.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

while looking at the streaming code I noticed two minor issues:

1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary


Yeah, right.
 
2) minor typo in one of the comments

Patch attached.

 Looks good to me.
 
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> while looking at the streaming code I noticed two minor issues:
>
> 1) logicalrep_read_stream_stop is never defined/called, so the prototype
> in logicalproto.h is unnecessary
>
> 2) minor typo in one of the comments
>
> Patch attached.
>

LGTM.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Sep 9, 2020 at 2:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > Hi,
> >
> > while looking at the streaming code I noticed two minor issues:
> >
> > 1) logicalrep_read_stream_stop is never defined/called, so the prototype
> > in logicalproto.h is unnecessary
> >
> > 2) minor typo in one of the comments
> >
> > Patch attached.
> >
>
> LGTM.
>

Pushed.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Tom Lane
Дата:
Amit Kapila <amit.kapila16@gmail.com> writes:
> Pushed.

Observe the following reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04

These are all on HEAD, and all within the last ten days, and I see
nothing comparable in any branch before that.  So it's hard to avoid
the conclusion that somebody broke something about ten days ago.

None of these animals provided gdb backtraces; but we do have a built-in
trace from several, and they all look like pgoutput.so is trying to
list_free() garbage, somewhere inside a relcache invalidation/rebuild
scenario:

TRAP: FailedAssertion("list->length > 0", File:
"/home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/../pgsql/src/backend/nodes/list.c",Line: 68) 
postgres: publisher: walsender bf [local] idle(ExceptionalCondition+0x57)[0x9081f7]
postgres: publisher: walsender bf [local] idle[0x6bcc70]
postgres: publisher: walsender bf [local] idle(list_free+0x11)[0x6bdc01]

/home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/tmp_install/home/bf/build/buildfarm-idiacanthus/HEAD/inst/lib/postgresql/pgoutput.so(+0x35d8)[0x7fa4c5a6f5d8]
postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0x15b)[0x8f0cdb]
postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0x4b)[0x7bca0b]
postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6]
postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c]
postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486]
postgres: publisher: walsender bf [local] idle[0x9017f2]
postgres: publisher: walsender bf [local] idle[0x8fabd4]
postgres: publisher: walsender bf [local] idle[0x8fa58a]
postgres: publisher: walsender bf [local] idle(RelationCacheInvalidateEntry+0xaf)[0x8fbdbf]
postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0xec)[0x8f0c6c]
postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0xcb)[0x7bca8b]
postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6]
postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c]
postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486]
postgres: publisher: walsender bf [local] idle[0x8ee8b0]

010_truncate.pl itself hasn't changed meaningfully in a good long time.
However, I see that 464824323 added a whole boatload of code to
pgoutput.c, and the timing is right for that commit to be the culprit,
so that's what I'm betting on.

Probably this requires a relcache inval at the wrong time;
although we have recent passes from CLOBBER_CACHE_ALWAYS animals,
so that can't be the whole triggering condition.  I wonder whether
it is relevant that all of the complaining animals are JIT-enabled.

            regards, tom lane



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Tom Lane
Дата:
I wrote:
> Probably this requires a relcache inval at the wrong time;
> although we have recent passes from CLOBBER_CACHE_ALWAYS animals,
> so that can't be the whole triggering condition.  I wonder whether
> it is relevant that all of the complaining animals are JIT-enabled.

Hmmm ... I take that back.  hyrax has indeed passed since this went
in, but *it doesn't run any TAP tests*.  So the buildfarm offers no
information about whether the replication tests work under
CLOBBER_CACHE_ALWAYS.

Realizing that, I built an installation that way and tried to run
the subscription tests.  Results so far:

* Running 010_truncate.pl by itself passed for me.  So there's still
some unexplained factor needed to trigger the buildfarm failures.
(I'm wondering about concurrent autovacuum activity now...)

* Starting over, it appears that 001_rep_changes.pl almost immediately
gets into an infinite loop.  It does not complete the third test step,
rather infinitely waiting for progress to be made.  The publisher log
shows a repeating loop like

2020-09-13 21:16:05.734 EDT [928529] tap_sub LOG:  could not send data to client: Broken pipe
2020-09-13 21:16:05.734 EDT [928529] tap_sub CONTEXT:  slot "tap_sub", output plugin "pgoutput", in the commit
callback,associated LSN 0/1660628 
2020-09-13 21:16:05.843 EDT [928581] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:16:05.861 EDT [928582] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:16:05.929 EDT [928582] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG:  starting logical decoding for slot "tap_sub"
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL:  Streaming transactions committing after 0/1652820, reading WAL
from0/1651B20. 
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG:  logical decoding found consistent point at 0/1651B20
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL:  There are no running transactions.
2020-09-13 21:16:21.560 EDT [928600] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:16:37.291 EDT [928610] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:16:52.959 EDT [928627] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:17:06.866 EDT [928636] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:06.934 EDT [928636] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:07.811 EDT [928638] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:07.880 EDT [928638] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:07.881 EDT [928638] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:07.881 EDT [928638] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:08.618 EDT [928641] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:17:08.753 EDT [928642] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:08.821 EDT [928642] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.689 EDT [928645] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:09.756 EDT [928645] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:09.757 EDT [928645] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:09.757 EDT [928645] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.841 EDT [928582] tap_sub LOG:  could not send data to client: Broken pipe
2020-09-13 21:17:09.841 EDT [928582] tap_sub CONTEXT:  slot "tap_sub", output plugin "pgoutput", in the commit
callback,associated LSN 0/1660628 

while the subscriber is repeating

2020-09-13 21:15:01.598 EDT [928528] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:02.178 EDT [928528] ERROR:  terminating logical replication worker due to timeout
2020-09-13 21:16:02.179 EDT [920797] LOG:  background worker "logical replication worker" (PID 928528) exited with exit
code1 
2020-09-13 21:16:02.606 EDT [928571] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:03.117 EDT [928571] ERROR:  could not start WAL streaming: ERROR:  replication slot "tap_sub" is
activefor PID 928529 
2020-09-13 21:16:03.118 EDT [920797] LOG:  background worker "logical replication worker" (PID 928571) exited with exit
code1 
2020-09-13 21:16:03.544 EDT [928574] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.053 EDT [928574] ERROR:  could not start WAL streaming: ERROR:  replication slot "tap_sub" is
activefor PID 928529 
2020-09-13 21:16:04.054 EDT [920797] LOG:  background worker "logical replication worker" (PID 928574) exited with exit
code1 
2020-09-13 21:16:04.479 EDT [928576] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.990 EDT [928576] ERROR:  could not start WAL streaming: ERROR:  replication slot "tap_sub" is
activefor PID 928529 
2020-09-13 21:16:04.990 EDT [920797] LOG:  background worker "logical replication worker" (PID 928576) exited with exit
code1 
2020-09-13 21:16:05.415 EDT [928579] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:17:05.994 EDT [928579] ERROR:  terminating logical replication worker due to timeout

I'm out of patience to investigate this for tonight, but there is
something extremely broken here; maybe more than one something.

            regards, tom lane



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Pushed.
>
> Observe the following reports:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04
>
> These are all on HEAD, and all within the last ten days, and I see
> nothing comparable in any branch before that.  So it's hard to avoid
> the conclusion that somebody broke something about ten days ago.
>

I'll analyze these reports.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Tom Lane
Дата:
I wrote:
> * Starting over, it appears that 001_rep_changes.pl almost immediately
> gets into an infinite loop.  It does not complete the third test step,
> rather infinitely waiting for progress to be made.

Ah, looking closer, the problem is that wal_receiver_timeout = 60s
is too short when the sender is using CCA.  It times out before we
can get through the needed data transmission.

            regards, tom lane



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Pushed.
>
> Observe the following reports:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04
>
> These are all on HEAD, and all within the last ten days, and I see
> nothing comparable in any branch before that.  So it's hard to avoid
> the conclusion that somebody broke something about ten days ago.
>
> None of these animals provided gdb backtraces; but we do have a built-in
> trace from several, and they all look like pgoutput.so is trying to
> list_free() garbage, somewhere inside a relcache invalidation/rebuild
> scenario:
>

Yeah, this is right, and here is some initial analysis. It seems to be
failing in below code:
rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}

This list can have elements only in 'streaming' mode (need to enable
'streaming' with Create Subscription command) whereas none of the
tests in 010_truncate.pl is using 'streaming', so this list should be
empty (NULL). The two different assertion failures shown in BF reports
in list_free code are as below:
Assert(list->length > 0);
Assert(list->length <= list->max_length);

It seems to me that this list is not initialized properly when it is
not used or maybe that is true in some special circumstances because
we initialize it in get_rel_sync_entry(). I am not sure if CCI build
is impacting this in some way.

--
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > Pushed.
> >
> > Observe the following reports:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04
> >
> > These are all on HEAD, and all within the last ten days, and I see
> > nothing comparable in any branch before that.  So it's hard to avoid
> > the conclusion that somebody broke something about ten days ago.
> >
> > None of these animals provided gdb backtraces; but we do have a built-in
> > trace from several, and they all look like pgoutput.so is trying to
> > list_free() garbage, somewhere inside a relcache invalidation/rebuild
> > scenario:
> >
>
> Yeah, this is right, and here is some initial analysis. It seems to be
> failing in below code:
> rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
>
> This list can have elements only in 'streaming' mode (need to enable
> 'streaming' with Create Subscription command) whereas none of the
> tests in 010_truncate.pl is using 'streaming', so this list should be
> empty (NULL). The two different assertion failures shown in BF reports
> in list_free code are as below:
> Assert(list->length > 0);
> Assert(list->length <= list->max_length);
>
> It seems to me that this list is not initialized properly when it is
> not used or maybe that is true in some special circumstances because
> we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> is impacting this in some way.


Even I have analyzed this but did not find any reason why the
streamed_txns list should be anything other than NULL.  The only thing
is we are initializing the entry->streamed_txns to NULL and the list
free is checking  "if (list == NIL)" then return. However IMHO, that
should not be an issue becase NIL is defined as (List*) NULL.  I am
doing further testing and investigation.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Yeah, this is right, and here is some initial analysis. It seems to be
> > failing in below code:
> > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
> >
> > This list can have elements only in 'streaming' mode (need to enable
> > 'streaming' with Create Subscription command) whereas none of the
> > tests in 010_truncate.pl is using 'streaming', so this list should be
> > empty (NULL). The two different assertion failures shown in BF reports
> > in list_free code are as below:
> > Assert(list->length > 0);
> > Assert(list->length <= list->max_length);
> >
> > It seems to me that this list is not initialized properly when it is
> > not used or maybe that is true in some special circumstances because
> > we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> > is impacting this in some way.
>
>
> Even I have analyzed this but did not find any reason why the
> streamed_txns list should be anything other than NULL.  The only thing
> is we are initializing the entry->streamed_txns to NULL and the list
> free is checking  "if (list == NIL)" then return. However IMHO, that
> should not be an issue becase NIL is defined as (List*) NULL.
>

Yeah, that is not the issue but it is better to initialize it with NIL
for the sake of consistency. The basic issue here was we were trying
to open/lock the relation(s) before initializing this list. Now, when
we process the invalidations during open relation, we try to access
this list in rel_sync_cache_relation_cb and that leads to assertion
failure. I have reproduced the exact scenario of 010_truncate.pl via
debugger. Basically, the backend on publisher has sent the
invalidation after truncating the relation 'tab1' and while processing
the truncate message if WALSender receives that message exactly after
creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
reproduced.

The attached patch will fix the issue. What do you think?

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Sep 14, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Yeah, this is right, and here is some initial analysis. It seems to be
> > > failing in below code:
> > > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
> > >
> > > This list can have elements only in 'streaming' mode (need to enable
> > > 'streaming' with Create Subscription command) whereas none of the
> > > tests in 010_truncate.pl is using 'streaming', so this list should be
> > > empty (NULL). The two different assertion failures shown in BF reports
> > > in list_free code are as below:
> > > Assert(list->length > 0);
> > > Assert(list->length <= list->max_length);
> > >
> > > It seems to me that this list is not initialized properly when it is
> > > not used or maybe that is true in some special circumstances because
> > > we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> > > is impacting this in some way.
> >
> >
> > Even I have analyzed this but did not find any reason why the
> > streamed_txns list should be anything other than NULL.  The only thing
> > is we are initializing the entry->streamed_txns to NULL and the list
> > free is checking  "if (list == NIL)" then return. However IMHO, that
> > should not be an issue becase NIL is defined as (List*) NULL.
> >
>
> Yeah, that is not the issue but it is better to initialize it with NIL
> for the sake of consistency. The basic issue here was we were trying
> to open/lock the relation(s) before initializing this list. Now, when
> we process the invalidations during open relation, we try to access
> this list in rel_sync_cache_relation_cb and that leads to assertion
> failure. I have reproduced the exact scenario of 010_truncate.pl via
> debugger. Basically, the backend on publisher has sent the
> invalidation after truncating the relation 'tab1' and while processing
> the truncate message if WALSender receives that message exactly after
> creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
> reproduced.

Yeah, this is an issue and I am also able to reproduce this manually
using gdb.  Basically, I have inserted some data in publication table
and after that, I stopped in get_rel_sync_entry after creating the
reentry and before calling GetRelationPublications.  Meanwhile, I have
truncated this table and then it hit the same issue you pointed here.

> The attached patch will fix the issue. What do you think?

The patch looks good to me and fixing the reported issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Tom Lane
Дата:
Amit Kapila <amit.kapila16@gmail.com> writes:
> The attached patch will fix the issue. What do you think?

I think it'd be cleaner to separate the initialization of a new entry from
validation altogether, along the lines of

    /* Find cached function info, creating if not found */
    oldctx = MemoryContextSwitchTo(CacheMemoryContext);
    entry = (RelationSyncEntry *) hash_search(RelationSyncCache,
                                              (void *) &relid,
                                              HASH_ENTER, &found);
    MemoryContextSwitchTo(oldctx);
    Assert(entry != NULL);

    if (!found)
    {
        /* immediately make a new entry valid enough to satisfy callbacks */
        entry->schema_sent = false;
        entry->streamed_txns = NIL;
        entry->replicate_valid = false;
        /* are there any other fields we should clear here for safety??? */
    }

    /* Fill it in if not valid */
    if (!entry->replicate_valid)
    {
        List       *pubids = GetRelationPublications(relid);
        ...

BTW, unless someone has changed the behavior of dynahash when I
wasn't looking, those MemoryContextSwitchTos shown above are useless.
Also, why does the comment refer to a "function" entry?

            regards, tom lane



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > The attached patch will fix the issue. What do you think?
>
> I think it'd be cleaner to separate the initialization of a new entry from
> validation altogether, along the lines of
>
>     /* Find cached function info, creating if not found */
>     oldctx = MemoryContextSwitchTo(CacheMemoryContext);
>     entry = (RelationSyncEntry *) hash_search(RelationSyncCache,
>                                               (void *) &relid,
>                                               HASH_ENTER, &found);
>     MemoryContextSwitchTo(oldctx);
>     Assert(entry != NULL);
>
>     if (!found)
>     {
>         /* immediately make a new entry valid enough to satisfy callbacks */
>         entry->schema_sent = false;
>         entry->streamed_txns = NIL;
>         entry->replicate_valid = false;
>         /* are there any other fields we should clear here for safety??? */
>     }
>

If we want to separate validation then we need to initialize other
fields like 'pubactions' and 'publish_as_relid' as well. I think it
will be better to arrange it the way you are suggesting. So, I will
change it along with other fields that required initialization.

>     /* Fill it in if not valid */
>     if (!entry->replicate_valid)
>     {
>         List       *pubids = GetRelationPublications(relid);
>         ...
>
> BTW, unless someone has changed the behavior of dynahash when I
> wasn't looking, those MemoryContextSwitchTos shown above are useless.
>

As far as I can see they are useless in this case but I think they
might be required in case the user provides its own allocator function
(using HASH_ALLOC). So, we can probably remove those from here?

> Also, why does the comment refer to a "function" entry?
>

It should be "relation" instead. I'll take care of changing this as well.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Tom Lane
Дата:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, unless someone has changed the behavior of dynahash when I
>> wasn't looking, those MemoryContextSwitchTos shown above are useless.

> As far as I can see they are useless in this case but I think they
> might be required in case the user provides its own allocator function
> (using HASH_ALLOC). So, we can probably remove those from here?

You could imagine writing a HASH_ALLOC allocator whose behavior
varies depending on CurrentMemoryContext, but it seems like a
pretty foolish/fragile way to do it.  In any case I can think of,
the hash table lives in one specific context and you really
really do not want parts of it spread across other contexts.
dynahash.c is not going to look kindly on pieces of what it
is managing disappearing from under it.

(To be clear, objects that the hash entries contain pointers to
are a different question.  But the hash entries themselves have
to have exactly the same lifespan as the hash table.)

            regards, tom lane



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Sep 15, 2020 at 8:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> BTW, unless someone has changed the behavior of dynahash when I
> >> wasn't looking, those MemoryContextSwitchTos shown above are useless.
>
> > As far as I can see they are useless in this case but I think they
> > might be required in case the user provides its own allocator function
> > (using HASH_ALLOC). So, we can probably remove those from here?
>
> You could imagine writing a HASH_ALLOC allocator whose behavior
> varies depending on CurrentMemoryContext, but it seems like a
> pretty foolish/fragile way to do it.  In any case I can think of,
> the hash table lives in one specific context and you really
> really do not want parts of it spread across other contexts.
> dynahash.c is not going to look kindly on pieces of what it
> is managing disappearing from under it.
>

I agree that doesn't make sense. I have fixed all the comments
discussed in the attached patch.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Sep 15, 2020 at 10:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 15, 2020 at 8:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > As far as I can see they are useless in this case but I think they
> > > might be required in case the user provides its own allocator function
> > > (using HASH_ALLOC). So, we can probably remove those from here?
> >
> > You could imagine writing a HASH_ALLOC allocator whose behavior
> > varies depending on CurrentMemoryContext, but it seems like a
> > pretty foolish/fragile way to do it.  In any case I can think of,
> > the hash table lives in one specific context and you really
> > really do not want parts of it spread across other contexts.
> > dynahash.c is not going to look kindly on pieces of what it
> > is managing disappearing from under it.
> >
>
> I agree that doesn't make sense. I have fixed all the comments
> discussed in the attached patch.
>

Pushed.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Noah Misch
Дата:
On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> Thanks, I have pushed the last patch. Let's wait for a day or so to
> see the buildfarm reports

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
failed the new 015_stream.pl test with the subscriber looping like this:

2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252) exited
withexit code 1
 
2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
...

What happened there?



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > see the buildfarm reports
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> failed the new 015_stream.pl test with the subscriber looping like this:
>

I will look into this.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > see the buildfarm reports
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> failed the new 015_stream.pl test with the subscriber looping like this:
>
> 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252) exited
withexit code 1
 
> 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> ...
>
> What happened there?
>

What is going on here is that the expected streaming file is missing.
Normally, the first time we send a stream of changes (some percentage
of transaction changes) we create the streaming file, and then in
respective streams we just keep on writing in that file the changes we
receive from the publisher, and on commit, we read that file and apply
all the changes.

The above kind of error can happen due to the following reasons: (a)
the first time we sent the stream and created the file and that got
removed before the second stream reached the subscriber. (b) from the
publisher-side, we never sent the indication that it is the first
stream and the subscriber directly tries to open the file thinking it
is already there.

Now, the publisher and subscriber log doesn't directly indicate any of
the above problems but I have some observations.

The subscriber log indicates that before the apply worker exits due to
an error the new apply worker gets started. We delete the
streaming-related temporary files on proc_exit, so one possibility
could have been that the new apply worker has created the streaming
file which the old apply worker has removed but that is not possible
because we always create these temp-files by having procid in the
path.

The other thing I observed in the code is that we can mark the
transaction as streamed (via ReorderBufferTruncateTxn) if we try to
stream a transaction that has no changes the first time we try to
stream the transaction. This would lead to symptom (b) because the
second-time when there are more changes we would stream the changes as
it is not the first time. However, this shouldn't happen because we
never pick-up a transaction to stream which has no changes. I can try
to fix the code here such that we don't mark the transaction as
streamed unless we have streamed at least one change but I don't see
how it is related to this particular test failure.

I am not sure why this failure is not repeated since it occurred a few
months back, it's probably a timing issue. I have few timing issues in
the last month or so related to this feature but I am not able to come
up with a theory if any of those would have fixed this problem.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
> >
> > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > see the buildfarm reports
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > failed the new 015_stream.pl test with the subscriber looping like this:
> >
> > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252) exited
withexit code 1
 
> > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > ...
> >
> > What happened there?
> >
>
> What is going on here is that the expected streaming file is missing.
> Normally, the first time we send a stream of changes (some percentage
> of transaction changes) we create the streaming file, and then in
> respective streams we just keep on writing in that file the changes we
> receive from the publisher, and on commit, we read that file and apply
> all the changes.
>
> The above kind of error can happen due to the following reasons: (a)
> the first time we sent the stream and created the file and that got
> removed before the second stream reached the subscriber. (b) from the
> publisher-side, we never sent the indication that it is the first
> stream and the subscriber directly tries to open the file thinking it
> is already there.
>
> Now, the publisher and subscriber log doesn't directly indicate any of
> the above problems but I have some observations.
>
> The subscriber log indicates that before the apply worker exits due to
> an error the new apply worker gets started. We delete the
> streaming-related temporary files on proc_exit, so one possibility
> could have been that the new apply worker has created the streaming
> file which the old apply worker has removed but that is not possible
> because we always create these temp-files by having procid in the
> path.

Yeah, and I have tried to test on this line, basically, after the
streaming has started I have set the binary=on.  Now using gdb I have
made the worker wait before it deletes the temp file and meanwhile the
new worker started and it worked properly as expected.

> The other thing I observed in the code is that we can mark the
> transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> stream a transaction that has no changes the first time we try to
> stream the transaction. This would lead to symptom (b) because the
> second-time when there are more changes we would stream the changes as
> it is not the first time. However, this shouldn't happen because we
> never pick-up a transaction to stream which has no changes. I can try
> to fix the code here such that we don't mark the transaction as
> streamed unless we have streamed at least one change but I don't see
> how it is related to this particular test failure.

Yeah, this can be improved but as you mentioned that we never select
an empty transaction for streaming so this case should not occur.  I
will perform some testing/review around this and report.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > What is going on here is that the expected streaming file is missing.
> > Normally, the first time we send a stream of changes (some percentage
> > of transaction changes) we create the streaming file, and then in
> > respective streams we just keep on writing in that file the changes we
> > receive from the publisher, and on commit, we read that file and apply
> > all the changes.
> >
> > The above kind of error can happen due to the following reasons: (a)
> > the first time we sent the stream and created the file and that got
> > removed before the second stream reached the subscriber. (b) from the
> > publisher-side, we never sent the indication that it is the first
> > stream and the subscriber directly tries to open the file thinking it
> > is already there.
> >
> > Now, the publisher and subscriber log doesn't directly indicate any of
> > the above problems but I have some observations.
> >
> > The subscriber log indicates that before the apply worker exits due to
> > an error the new apply worker gets started. We delete the
> > streaming-related temporary files on proc_exit, so one possibility
> > could have been that the new apply worker has created the streaming
> > file which the old apply worker has removed but that is not possible
> > because we always create these temp-files by having procid in the
> > path.
>
> Yeah, and I have tried to test on this line, basically, after the
> streaming has started I have set the binary=on.  Now using gdb I have
> made the worker wait before it deletes the temp file and meanwhile the
> new worker started and it worked properly as expected.
>
> > The other thing I observed in the code is that we can mark the
> > transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> > stream a transaction that has no changes the first time we try to
> > stream the transaction. This would lead to symptom (b) because the
> > second-time when there are more changes we would stream the changes as
> > it is not the first time. However, this shouldn't happen because we
> > never pick-up a transaction to stream which has no changes. I can try
> > to fix the code here such that we don't mark the transaction as
> > streamed unless we have streamed at least one change but I don't see
> > how it is related to this particular test failure.
>
> Yeah, this can be improved but as you mentioned that we never select
> an empty transaction for streaming so this case should not occur.  I
> will perform some testing/review around this and report.
>

On further thinking about this point, I think the message seen on
subscriber [1] won't occur if missed the first stream. This is because
we always check the value of fileset from the stream hash table
(xidhash) and it won't be there if we directly send the second stream
and that would have lead to a different kind of problem (probably
crash). This symptom seems to be due to the reason (a) mentioned above
unless we are missing something else. Now, I am not sure how the file
can be removed without the corresponding entry in hash table (xidhash)
is still present. The only reasons that come to mind are that some
other process cleaned pgsql_tmp directory thinking these temporary
file are not required or one manually removes it, none of those seems
plausible reasons.

[1] - ERROR: could not open temporary file "16393-510.changes.0" from
BufFile "16393-510.changes": No such file or directory

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
> > >
> > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > > see the buildfarm reports
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > > failed the new 015_stream.pl test with the subscriber looping like this:
> > >
> > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252)
exitedwith exit code 1
 
> > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > > ...
> > >
> > > What happened there?
> > >
> >
> > What is going on here is that the expected streaming file is missing.
> > Normally, the first time we send a stream of changes (some percentage
> > of transaction changes) we create the streaming file, and then in
> > respective streams we just keep on writing in that file the changes we
> > receive from the publisher, and on commit, we read that file and apply
> > all the changes.
> >
> > The above kind of error can happen due to the following reasons: (a)
> > the first time we sent the stream and created the file and that got
> > removed before the second stream reached the subscriber. (b) from the
> > publisher-side, we never sent the indication that it is the first
> > stream and the subscriber directly tries to open the file thinking it
> > is already there.
> >
> > Now, the publisher and subscriber log doesn't directly indicate any of
> > the above problems but I have some observations.
> >
> > The subscriber log indicates that before the apply worker exits due to
> > an error the new apply worker gets started. We delete the
> > streaming-related temporary files on proc_exit, so one possibility
> > could have been that the new apply worker has created the streaming
> > file which the old apply worker has removed but that is not possible
> > because we always create these temp-files by having procid in the
> > path.
>
> Yeah, and I have tried to test on this line, basically, after the
> streaming has started I have set the binary=on.  Now using gdb I have
> made the worker wait before it deletes the temp file and meanwhile the
> new worker started and it worked properly as expected.
>
> > The other thing I observed in the code is that we can mark the
> > transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> > stream a transaction that has no changes the first time we try to
> > stream the transaction. This would lead to symptom (b) because the
> > second-time when there are more changes we would stream the changes as
> > it is not the first time. However, this shouldn't happen because we
> > never pick-up a transaction to stream which has no changes. I can try
> > to fix the code here such that we don't mark the transaction as
> > streamed unless we have streamed at least one change but I don't see
> > how it is related to this particular test failure.
>
> Yeah, this can be improved but as you mentioned that we never select
> an empty transaction for streaming so this case should not occur.  I
> will perform some testing/review around this and report.

I have executed "make check" in the loop with only this file.  I have
repeated it 5000 times but no failure, I am wondering shall we try to
execute in the same machine in a loop where it failed once?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
> > > >
> > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > > > see the buildfarm reports
> > > >
> > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > > > failed the new 015_stream.pl test with the subscriber looping like this:
> > > >
> > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from
BufFile"16393-510.changes": No such file or directory
 
> > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252)
exitedwith exit code 1
 
> > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from
BufFile"16393-510.changes": No such file or directory
 
> > > > ...
> > > >
> > > > What happened there?
> > > >
> > >
> > > What is going on here is that the expected streaming file is missing.
> > > Normally, the first time we send a stream of changes (some percentage
> > > of transaction changes) we create the streaming file, and then in
> > > respective streams we just keep on writing in that file the changes we
> > > receive from the publisher, and on commit, we read that file and apply
> > > all the changes.
> > >
> > > The above kind of error can happen due to the following reasons: (a)
> > > the first time we sent the stream and created the file and that got
> > > removed before the second stream reached the subscriber. (b) from the
> > > publisher-side, we never sent the indication that it is the first
> > > stream and the subscriber directly tries to open the file thinking it
> > > is already there.
> > >
>
> I have executed "make check" in the loop with only this file.  I have
> repeated it 5000 times but no failure, I am wondering shall we try to
> execute in the same machine in a loop where it failed once?
>

Yes, that might help. Noah, would it be possible for you to try that
out, and if it failed then probably get the stack trace of subscriber?
If we are able to reproduce it then we can add elogs in functions
SharedFileSetInit, BufFileCreateShared, BufFileOpenShared, and
SharedFileSetDeleteAll to print the paths to see if we are sometimes
unintentionally removing some files. I have checked the code and there
doesn't appear to be any such problems but I might be missing
something.

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Noah Misch
Дата:
On Wed, Dec 02, 2020 at 01:50:25PM +0530, Amit Kapila wrote:
> On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
> > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > > > > see the buildfarm reports
> > > > >
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > > > > failed the new 015_stream.pl test with the subscriber looping like this:
> > > > >
> > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub"
hasstarted
 
> > > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from
BufFile"16393-510.changes": No such file or directory
 
> > > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252)
exitedwith exit code 1
 
> > > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from
BufFile"16393-510.changes": No such file or directory
 
> > > > > ...

> > > > The above kind of error can happen due to the following reasons: (a)
> > > > the first time we sent the stream and created the file and that got
> > > > removed before the second stream reached the subscriber. (b) from the
> > > > publisher-side, we never sent the indication that it is the first
> > > > stream and the subscriber directly tries to open the file thinking it
> > > > is already there.

Further testing showed it was a file location problem, not a deletion problem.
The worker tried to open
base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these
were the files actually existing:

[nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name '*sharedfileset*')
src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset:
total 408
drwx------    2 nm       usr             256 Dec 08 03:20 .
drwx------    4 nm       usr             256 Dec 08 03:20 ..
-rw-------    1 nm       usr          207806 Dec 08 03:20 16393-510.changes.0

src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset:
total 0
drwx------    2 nm       usr             256 Dec 08 03:20 .
drwx------    4 nm       usr             256 Dec 08 03:20 ..
-rw-------    1 nm       usr               0 Dec 08 03:20 16393-511.changes.0

> > I have executed "make check" in the loop with only this file.  I have
> > repeated it 5000 times but no failure, I am wondering shall we try to
> > execute in the same machine in a loop where it failed once?
> 
> Yes, that might help. Noah, would it be possible for you to try that

The problem is xidhash using strcmp() to compare keys; it needs memcmp().  For
this to matter, xidhash must contain more than one element.  Existing tests
rarely exercise the multi-element scenario.  Under heavy load, on this system,
the test publisher can have two active transactions at once, in which case it
does exercise multi-element xidhash.  (The publisher is sensitive to timing,
but the subscriber is not; once WAL contains interleaved records of two XIDs,
the subscriber fails every time.)  This would be much harder to reproduce on a
little-endian system, where strcmp(&xid, &xid_plus_one)!=0.  On big-endian,
every small XID has zero in the first octet; they all look like empty strings.

The attached patch has the one-line fix and some test suite changes that make
this reproduce frequently on any big-endian system.  I'm currently planning to
drop the test suite changes from the commit, but I could keep them if folks
like them.  (They'd need more comments and timeout handling.)

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah@leadboat.com> wrote:
>
> Further testing showed it was a file location problem, not a deletion problem.
> The worker tried to open
> base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these
> were the files actually existing:
>
> [nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name
'*sharedfileset*')
> src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset:
> total 408
> drwx------    2 nm       usr             256 Dec 08 03:20 .
> drwx------    4 nm       usr             256 Dec 08 03:20 ..
> -rw-------    1 nm       usr          207806 Dec 08 03:20 16393-510.changes.0
>
> src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset:
> total 0
> drwx------    2 nm       usr             256 Dec 08 03:20 .
> drwx------    4 nm       usr             256 Dec 08 03:20 ..
> -rw-------    1 nm       usr               0 Dec 08 03:20 16393-511.changes.0
>
> > > I have executed "make check" in the loop with only this file.  I have
> > > repeated it 5000 times but no failure, I am wondering shall we try to
> > > execute in the same machine in a loop where it failed once?
> >
> > Yes, that might help. Noah, would it be possible for you to try that
>
> The problem is xidhash using strcmp() to compare keys; it needs memcmp().  For
> this to matter, xidhash must contain more than one element.  Existing tests
> rarely exercise the multi-element scenario.  Under heavy load, on this system,
> the test publisher can have two active transactions at once, in which case it
> does exercise multi-element xidhash.  (The publisher is sensitive to timing,
> but the subscriber is not; once WAL contains interleaved records of two XIDs,
> the subscriber fails every time.)  This would be much harder to reproduce on a
> little-endian system, where strcmp(&xid, &xid_plus_one)!=0.  On big-endian,
> every small XID has zero in the first octet; they all look like empty strings.
>

Your analysis is correct.

> The attached patch has the one-line fix and some test suite changes that make
> this reproduce frequently on any big-endian system.  I'm currently planning to
> drop the test suite changes from the commit, but I could keep them if folks
> like them.  (They'd need more comments and timeout handling.)
>

I think it is better to keep this test which can always test multiple
streams on the subscriber.

Thanks for working on this.

-- 
With Regards,
Amit Kapila.



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah@leadboat.com> wrote:
>> The problem is xidhash using strcmp() to compare keys; it needs memcmp().

> Your analysis is correct.

Sorry for not having noticed this thread before.  Noah's fix is
clearly correct, and I have no objection to the added test case.
But what jumps out at me here is that this sort of error seems way
too easy to make, and evidently way too hard to detect.  What can we
do to make it more obvious if one has incorrectly used or omitted
HASH_BLOBS?  Both directions of error might easily escape notice on
little-endian hardware.

I thought of a few ideas, all of which have drawbacks:

1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
This seems to just move the problem somewhere else, besides which
it'd require touching an awful lot of callers, and would silently
break third-party callers.

2. Don't allow a default: invent a new HASH_STRING flag, and
require that hash_create() calls specify exactly one of HASH_BLOBS,
HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
hazard of mindless-copy-and-paste, but I think it might make it
a little more obvious.  Still requires touching a lot of calls.

3. Add some sort of heuristic restriction on keysize.  A keysize
that's only 4 or 8 bytes almost certainly is not a string.
This doesn't give us much traction for larger keysizes, though.

4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
in string_hash().  I think we could get away with that given that
SQL disallows empty identifiers.  However, it would only help to
catch one direction of error (omitting HASH_BLOBS), and it would
only help on big-endian hardware, which is getting harder to find.
Still, we could hope that the buildfarm would detect errors.

There might be some more options.  Also, some of these ideas
could be applied in combination.

A quick count of grep hits suggest that the large majority of
existing hash_create() calls use HASH_BLOBS, and there might be
only order-of-ten calls that would need to be touched if we
required an explicit HASH_STRING flag.  So option #2 is seeming
kind of attractive.  Maybe that together with an assertion that
string keys have to exceed 8 or 16 bytes would be enough protection.

Also, this census now suggests to me that the opposite problem
(copy-and-paste HASH_BLOBS when you meant string keys) might be
a real hazard, since so many of the existing prototypes that you
might copy have HASH_BLOBS.  I'm not sure if there's much to be
done for this case though.  A small saving grace is that it seems
relatively likely that you'd notice a functional problem pretty
quickly with this type of mistake, since lookups would tend to
fail due to trailing garbage after your lookup string.

A different angle we could think about is that the name "HASH_BLOBS"
is kind of un-obvious.  Maybe we should deprecate that spelling in
favor of something like "HASH_BINARY".

Thoughts?

            regards, tom lane



On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
> But what jumps out at me here is that this sort of error seems way
> too easy to make, and evidently way too hard to detect.  What can we
> do to make it more obvious if one has incorrectly used or omitted
> HASH_BLOBS?  Both directions of error might easily escape notice on
> little-endian hardware.
> 
> I thought of a few ideas, all of which have drawbacks:
> 
> 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
> This seems to just move the problem somewhere else, besides which
> it'd require touching an awful lot of callers, and would silently
> break third-party callers.
> 
> 2. Don't allow a default: invent a new HASH_STRING flag, and
> require that hash_create() calls specify exactly one of HASH_BLOBS,
> HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
> hazard of mindless-copy-and-paste, but I think it might make it
> a little more obvious.  Still requires touching a lot of calls.

I like (2), for making the bug harder and for greppability.  Probably
pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS.

> 3. Add some sort of heuristic restriction on keysize.  A keysize
> that's only 4 or 8 bytes almost certainly is not a string.
> This doesn't give us much traction for larger keysizes, though.
> 
> 4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
> in string_hash().  I think we could get away with that given that
> SQL disallows empty identifiers.  However, it would only help to
> catch one direction of error (omitting HASH_BLOBS), and it would
> only help on big-endian hardware, which is getting harder to find.
> Still, we could hope that the buildfarm would detect errors.

It's nontrivial to confirm that the empty-string key can't happen for a given
hash table.  (In contrast, what (3) asserts on is usually a compile-time
constant.)  I would stop short of adding (4), though it could be okay.

> A quick count of grep hits suggest that the large majority of
> existing hash_create() calls use HASH_BLOBS, and there might be
> only order-of-ten calls that would need to be touched if we
> required an explicit HASH_STRING flag.  So option #2 is seeming
> kind of attractive.  Maybe that together with an assertion that
> string keys have to exceed 8 or 16 bytes would be enough protection.

Agreed.  I expect (2) gives most of the benefit.  Requiring 8-byte capacity
should be harmless, and most architectures can zero 8 bytes in one
instruction.  Requiring more bytes trades specificity for sensitivity.

> A different angle we could think about is that the name "HASH_BLOBS"
> is kind of un-obvious.  Maybe we should deprecate that spelling in
> favor of something like "HASH_BINARY".

With (2) in place, I wouldn't worry about renaming HASH_BLOBS.  It's hard to
confuse with HASH_STRINGS or HASH_FUNCTION.  If anything, HASH_BLOBS conveys
something more specific.  HASH_FUNCTION cases see binary data, but that data
has structure that promotes it out of "blob" status.



On 2020-12-13 17:49, Tom Lane wrote:
> 2. Don't allow a default: invent a new HASH_STRING flag, and
> require that hash_create() calls specify exactly one of HASH_BLOBS,
> HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
> hazard of mindless-copy-and-paste, but I think it might make it
> a little more obvious.  Still requires touching a lot of calls.

I think this sounds best, and also expand the documentation of these 
flags a bit.

-- 
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/



On Mon, Dec 14, 2020 at 1:36 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
> > But what jumps out at me here is that this sort of error seems way
> > too easy to make, and evidently way too hard to detect.  What can we
> > do to make it more obvious if one has incorrectly used or omitted
> > HASH_BLOBS?  Both directions of error might easily escape notice on
> > little-endian hardware.
> >
> > I thought of a few ideas, all of which have drawbacks:
> >
> > 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
> > This seems to just move the problem somewhere else, besides which
> > it'd require touching an awful lot of callers, and would silently
> > break third-party callers.
> >
> > 2. Don't allow a default: invent a new HASH_STRING flag, and
> > require that hash_create() calls specify exactly one of HASH_BLOBS,
> > HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
> > hazard of mindless-copy-and-paste, but I think it might make it
> > a little more obvious.  Still requires touching a lot of calls.
>
> I like (2), for making the bug harder and for greppability.  Probably
> pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS.
>
> > 3. Add some sort of heuristic restriction on keysize.  A keysize
> > that's only 4 or 8 bytes almost certainly is not a string.
> > This doesn't give us much traction for larger keysizes, though.
> >
> > 4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
> > in string_hash().  I think we could get away with that given that
> > SQL disallows empty identifiers.  However, it would only help to
> > catch one direction of error (omitting HASH_BLOBS), and it would
> > only help on big-endian hardware, which is getting harder to find.
> > Still, we could hope that the buildfarm would detect errors.
>
> It's nontrivial to confirm that the empty-string key can't happen for a given
> hash table.  (In contrast, what (3) asserts on is usually a compile-time
> constant.)  I would stop short of adding (4), though it could be okay.
>
> > A quick count of grep hits suggest that the large majority of
> > existing hash_create() calls use HASH_BLOBS, and there might be
> > only order-of-ten calls that would need to be touched if we
> > required an explicit HASH_STRING flag.  So option #2 is seeming
> > kind of attractive.  Maybe that together with an assertion that
> > string keys have to exceed 8 or 16 bytes would be enough protection.
>
> Agreed.  I expect (2) gives most of the benefit.  Requiring 8-byte capacity
> should be harmless, and most architectures can zero 8 bytes in one
> instruction.  Requiring more bytes trades specificity for sensitivity.
>

+1. I also think in most cases (2) would be sufficient to avoid such
bugs. Adding restriction on string size might annoy some out-of-core
user which is already using small strings. However, adding an 8-byte
restriction on string size would be still okay.

-- 
With Regards,
Amit Kapila.



Noah Misch <noah@leadboat.com> writes:
> On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
>> A quick count of grep hits suggest that the large majority of
>> existing hash_create() calls use HASH_BLOBS, and there might be
>> only order-of-ten calls that would need to be touched if we
>> required an explicit HASH_STRING flag.  So option #2 is seeming
>> kind of attractive.  Maybe that together with an assertion that
>> string keys have to exceed 8 or 16 bytes would be enough protection.

> Agreed.  I expect (2) gives most of the benefit.  Requiring 8-byte capacity
> should be harmless, and most architectures can zero 8 bytes in one
> instruction.  Requiring more bytes trades specificity for sensitivity.

Attached is a proposed patch that requires HASH_STRINGS to be stated
explicitly (in the event, there are 13 callers needing that) and insists
on keysize > 8 for string keys.  In examining the now-easily-visible uses
of string keys, almost all of them are using NAMEDATALEN-sized keys, or
in a few places larger values.  Only two are smaller:

1. ShmemIndex uses SHMEM_INDEX_KEYSIZE, which is only set to 48.

2. ResetUnloggedRelationsInDbspaceDir is using OIDCHARS + 1, because
it stores relfilenode OIDs as strings.  That seems pretty damfool
to me, so I'm inclined to change it to store binary OIDs instead;
those'd be a third the size (or probably a quarter the size after
alignment padding) and likely faster to hash or compare.  But I
didn't do that here, since it's still more than 8.  (I did whack
it upside the head to the extent of not storing its temporary
hash table in CacheMemoryContext.)

So it seems to me that insisting on keysize > 8 is fine.

There are a couple of other API oddities that maybe we should think
about while we're here:

* Should we just have a blanket insistence that all callers supply
HASH_ELEM?  The default sizes that dynahash.c uses without that are
undocumented and basically useless.  We're already asserting that
in the HASH_BLOBS path, which is the majority use-case, and this
patch now asserts it for HASH_STRINGS too.

* The coding convention that the HASHCTL argument struct should be
pre-zeroed seems to have been ignored at a lot of call sites.
I added a memset call to a couple of callers that I was touching
in this patch, but I'm having second thoughts about that.  Maybe
we should just rip out all those memsets as pointless, since there's
basically no case where you'd use the memset to fill a field that
you meant to pass as zero.  The fact that hash_create() doesn't
read fields it's not told to by a flag means we should not need
the memsets to avoid uninitialized-memory reads.

            regards, tom lane

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 2dc9e44ae6..8b17fb06eb 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2604,10 +2604,12 @@ createConnHash(void)
 {
     HASHCTL        ctl;

+    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = NAMEDATALEN;
     ctl.entrysize = sizeof(remoteConnHashEnt);

-    return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM);
+    return hash_create("Remote Con hash", NUMCONN, &ctl,
+                       HASH_ELEM | HASH_STRINGS);
 }

 static void
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 85986ec24a..ec7819ca77 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -726,7 +726,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
     crosstab_hash = hash_create("crosstab hash",
                                 INIT_CATS,
                                 &ctl,
-                                HASH_ELEM | HASH_CONTEXT);
+                                HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

     /* Connect to SPI manager */
     if ((ret = SPI_connect()) < 0)
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 4b18be5b27..5ba7c2eb3c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -414,7 +414,7 @@ InitQueryHashTable(void)
     prepared_queries = hash_create("Prepared Queries",
                                    32,
                                    &hash_ctl,
-                                   HASH_ELEM);
+                                   HASH_ELEM | HASH_STRINGS);
 }

 /*
diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c
index ab04459c55..2fe89fd361 100644
--- a/src/backend/nodes/extensible.c
+++ b/src/backend/nodes/extensible.c
@@ -51,7 +51,8 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label,
         ctl.keysize = EXTNODENAME_MAX_LEN;
         ctl.entrysize = sizeof(ExtensibleNodeEntry);

-        *p_htable = hash_create(htable_label, 100, &ctl, HASH_ELEM);
+        *p_htable = hash_create(htable_label, 100, &ctl,
+                                HASH_ELEM | HASH_STRINGS);
     }

     if (strlen(extnodename) >= EXTNODENAME_MAX_LEN)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0c2094f766..f21ab67ae4 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -175,7 +175,9 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
         memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(unlogged_relation_entry);
         ctl.entrysize = sizeof(unlogged_relation_entry);
-        hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
+        ctl.hcxt = CurrentMemoryContext;
+        hash = hash_create("unlogged hash", 32, &ctl,
+                           HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

         /* Scan the directory. */
         dbspace_dir = AllocateDir(dbspacedirname);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 97716f6aef..0afd87e075 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -292,7 +292,6 @@ void
 InitShmemIndex(void)
 {
     HASHCTL        info;
-    int            hash_flags;

     /*
      * Create the shared memory shmem index.
@@ -302,13 +301,14 @@ InitShmemIndex(void)
      * initializing the ShmemIndex itself.  The special "ShmemIndex" hash
      * table name will tell ShmemInitStruct to fake it.
      */
+    memset(&info, 0, sizeof(info));
     info.keysize = SHMEM_INDEX_KEYSIZE;
     info.entrysize = sizeof(ShmemIndexEnt);
-    hash_flags = HASH_ELEM;

     ShmemIndex = ShmemInitHash("ShmemIndex",
                                SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
-                               &info, hash_flags);
+                               &info,
+                               HASH_ELEM | HASH_STRINGS);
 }

 /*
@@ -329,6 +329,10 @@ InitShmemIndex(void)
  * whose maximum size is certain, this should be equal to max_size; that
  * ensures that no run-time out-of-shared-memory failures can occur.
  *
+ * *infoP and hash_flags should specify at least the entry sizes and key
+ * comparison semantics (see hash_create()).  Flag bits specific to
+ * shared-memory hash tables are added here.
+ *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 12557ce3af..be0a45b55e 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -3446,7 +3446,7 @@ get_json_object_as_hash(char *json, int len, const char *funcname)
     tab = hash_create("json object hashtable",
                       100,
                       &ctl,
-                      HASH_ELEM | HASH_CONTEXT);
+                      HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

     state = palloc0(sizeof(JHashState));
     sem = palloc0(sizeof(JsonSemAction));
@@ -3838,7 +3838,7 @@ populate_recordset_object_start(void *state)
     _state->json_hash = hash_create("json object hashtable",
                                     100,
                                     &ctl,
-                                    HASH_ELEM | HASH_CONTEXT);
+                                    HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 }

 static void
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ad582f99a5..87a3154c1a 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -3471,7 +3471,7 @@ set_rtable_names(deparse_namespace *dpns, List *parent_namespaces,
     names_hash = hash_create("set_rtable_names names",
                              list_length(dpns->rtable),
                              &hash_ctl,
-                             HASH_ELEM | HASH_CONTEXT);
+                             HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
     /* Preload the hash table with names appearing in parent_namespaces */
     foreach(lc, parent_namespaces)
     {
diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c
index bd779fdaf7..e83e30defe 100644
--- a/src/backend/utils/fmgr/dfmgr.c
+++ b/src/backend/utils/fmgr/dfmgr.c
@@ -686,7 +686,7 @@ find_rendezvous_variable(const char *varName)
         rendezvousHash = hash_create("Rendezvous variable hash",
                                      16,
                                      &ctl,
-                                     HASH_ELEM);
+                                     HASH_ELEM | HASH_STRINGS);
     }

     /* Find or create the hashtable entry for this varName */
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index d14d875c93..07cae638df 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -30,11 +30,12 @@
  * dynahash.c provides support for these types of lookup keys:
  *
  * 1. Null-terminated C strings (truncated if necessary to fit in keysize),
- * compared as though by strcmp().  This is the default behavior.
+ * compared as though by strcmp().  This is selected by specifying the
+ * HASH_STRINGS flag to hash_create.
  *
  * 2. Arbitrary binary data of size keysize, compared as though by memcmp().
  * (Caller must ensure there are no undefined padding bits in the keys!)
- * This is selected by specifying HASH_BLOBS flag to hash_create.
+ * This is selected by specifying the HASH_BLOBS flag to hash_create.
  *
  * 3. More complex key behavior can be selected by specifying user-supplied
  * hashing, comparison, and/or key-copying functions.  At least a hashing
@@ -47,8 +48,8 @@
  *   locks.
  * - Shared memory hashes are allocated in a fixed size area at startup and
  *   are discoverable by name from other processes.
- * - Because entries don't need to be moved in the case of hash conflicts, has
- *   better performance for large entries
+ * - Because entries don't need to be moved in the case of hash conflicts,
+ *   dynahash has better performance for large entries.
  * - Guarantees stable pointers to entries.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
@@ -316,6 +317,12 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *    *info: additional table parameters, as indicated by flags
  *    flags: bitmask indicating which parameters to take from *info
  *
+ * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * or HASH_FUNCTION, to define the key hashing semantics (C strings,
+ * binary blobs, or custom, respectively).  Callers specifying a custom
+ * hash function will likely also want to use HASH_COMPARE, and perhaps
+ * also HASH_KEYCOPY, to control key comparison and copying.
+ *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
  * hashtable can be expanded on-the-fly, so it's better for nelem to be
@@ -370,9 +377,13 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
      * Select the appropriate hash function (see comments at head of file).
      */
     if (flags & HASH_FUNCTION)
+    {
+        Assert(!(flags & (HASH_BLOBS | HASH_STRINGS)));
         hashp->hash = info->hash;
+    }
     else if (flags & HASH_BLOBS)
     {
+        Assert(!(flags & HASH_STRINGS));
         /* We can optimize hashing for common key sizes */
         Assert(flags & HASH_ELEM);
         if (info->keysize == sizeof(uint32))
@@ -381,17 +392,30 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
             hashp->hash = tag_hash;
     }
     else
-        hashp->hash = string_hash;    /* default hash function */
+    {
+        /*
+         * string_hash used to be considered the default hash method, and in a
+         * non-assert build it effectively still is.  But we now consider it
+         * an assertion error to not say HASH_STRINGS explicitly.  To help
+         * catch mistaken usage of HASH_STRINGS, we also insist on a
+         * reasonably long string length: if the keysize is only 4 or 8 bytes,
+         * it's almost certainly an integer or pointer not a string.
+         */
+        Assert(flags & HASH_STRINGS);
+        Assert(flags & HASH_ELEM);
+        Assert(info->keysize > 8);
+
+        hashp->hash = string_hash;
+    }

     /*
      * If you don't specify a match function, it defaults to string_compare if
-     * you used string_hash (either explicitly or by default) and to memcmp
-     * otherwise.
+     * you used string_hash, and to memcmp otherwise.
      *
      * Note: explicitly specifying string_hash is deprecated, because this
      * might not work for callers in loadable modules on some platforms due to
      * referencing a trampoline instead of the string_hash function proper.
-     * Just let it default, eh?
+     * Specify HASH_STRINGS instead.
      */
     if (flags & HASH_COMPARE)
         hashp->match = info->match;
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index ec6f80ee99..a382c4219b 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -111,6 +111,7 @@ EnablePortalManager(void)
                                              "TopPortalContext",
                                              ALLOCSET_DEFAULT_SIZES);

+    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = MAX_PORTALNAME_LEN;
     ctl.entrysize = sizeof(PortalHashEnt);

@@ -119,7 +120,7 @@ EnablePortalManager(void)
      * create, initially
      */
     PortalHashTable = hash_create("Portal hash", PORTALS_PER_USER,
-                                  &ctl, HASH_ELEM);
+                                  &ctl, HASH_ELEM | HASH_STRINGS);
 }

 /*
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index bebf89b3c4..666ad33567 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -82,7 +82,8 @@ typedef struct HASHCTL
 #define HASH_PARTITION    0x0001    /* Hashtable is used w/partitioned locking */
 #define HASH_SEGMENT    0x0002    /* Set segment size */
 #define HASH_DIRSIZE    0x0004    /* Set directory size (initial and max) */
-#define HASH_ELEM        0x0010    /* Set keysize and entrysize */
+#define HASH_ELEM        0x0008    /* Set keysize and entrysize */
+#define HASH_STRINGS    0x0010    /* Select support functions for string keys */
 #define HASH_BLOBS        0x0020    /* Select support functions for binary keys */
 #define HASH_FUNCTION    0x0040    /* Set user defined hash function */
 #define HASH_COMPARE    0x0080    /* Set user defined comparison function */
@@ -119,7 +120,8 @@ typedef struct
  *
  * Note: It is deprecated for callers of hash_create to explicitly specify
  * string_hash, tag_hash, uint32_hash, or oid_hash.  Just set HASH_BLOBS or
- * not.  Use HASH_FUNCTION only when you want something other than those.
+ * HASH_STRINGS.  Use HASH_FUNCTION only when you want something other than
+ * one of these.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
                          HASHCTL *info, int flags);
diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c
index 4de756455d..60f5d66264 100644
--- a/src/pl/plperl/plperl.c
+++ b/src/pl/plperl/plperl.c
@@ -586,7 +586,7 @@ select_perl_context(bool trusted)
         interp_desc->query_hash = hash_create("PL/Perl queries",
                                               32,
                                               &hash_ctl,
-                                              HASH_ELEM);
+                                              HASH_ELEM | HASH_STRINGS);
     }

     /*
diff --git a/src/timezone/pgtz.c b/src/timezone/pgtz.c
index 3f0fb51e91..5240cab022 100644
--- a/src/timezone/pgtz.c
+++ b/src/timezone/pgtz.c
@@ -211,7 +211,7 @@ init_timezone_hashtable(void)
     timezone_cache = hash_create("Timezones",
                                  4,
                                  &hash_ctl,
-                                 HASH_ELEM);
+                                 HASH_ELEM | HASH_STRINGS);
     if (!timezone_cache)
         return false;


I wrote:
> There are a couple of other API oddities that maybe we should think
> about while we're here:

> * Should we just have a blanket insistence that all callers supply
> HASH_ELEM?  The default sizes that dynahash.c uses without that are
> undocumented and basically useless.  We're already asserting that
> in the HASH_BLOBS path, which is the majority use-case, and this
> patch now asserts it for HASH_STRINGS too.

Here's a follow-up patch for that part, which also tries to respond
a bit to Heikki's complaint about skimpy documentation.  While at it,
I const-ified the HASHCTL argument, since there's no need for
hash_create to modify that.

            regards, tom lane

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 07cae638df..49f21b77bb 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -317,11 +317,20 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *    *info: additional table parameters, as indicated by flags
  *    flags: bitmask indicating which parameters to take from *info
  *
- * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * The flags value *must* include HASH_ELEM.  (Formerly, this was nominally
+ * optional, but the default keysize and entrysize values were useless.)
+ * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS,
  * or HASH_FUNCTION, to define the key hashing semantics (C strings,
  * binary blobs, or custom, respectively).  Callers specifying a custom
  * hash function will likely also want to use HASH_COMPARE, and perhaps
  * also HASH_KEYCOPY, to control key comparison and copying.
+ * Another often-used flag is HASH_CONTEXT, to allocate the hash table
+ * under info->hcxt rather than under TopMemoryContext; the default
+ * behavior is only suitable for session-lifespan hash tables.
+ * Other flags bits are special-purpose and seldom used.
+ *
+ * Fields in *info are read only when the associated flags bit is set.
+ * It is not necessary to initialize other fields of *info.
  *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
@@ -330,11 +339,19 @@ string_compare(const char *key1, const char *key2, Size keysize)
  * large nelem will penalize hash_seq_search speed without buying much.
  */
 HTAB *
-hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
+hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
     HTAB       *hashp;
     HASHHDR    *hctl;

+    /*
+     * Hash tables now allocate space for key and data, but you have to say
+     * how much space to allocate.
+     */
+    Assert(flags & HASH_ELEM);
+    Assert(info->keysize > 0);
+    Assert(info->entrysize >= info->keysize);
+
     /*
      * For shared hash tables, we have a local hash header (HTAB struct) that
      * we allocate in TopMemoryContext; all else is in shared memory.
@@ -385,7 +402,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
     {
         Assert(!(flags & HASH_STRINGS));
         /* We can optimize hashing for common key sizes */
-        Assert(flags & HASH_ELEM);
         if (info->keysize == sizeof(uint32))
             hashp->hash = uint32_hash;
         else
@@ -402,7 +418,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
          * it's almost certainly an integer or pointer not a string.
          */
         Assert(flags & HASH_STRINGS);
-        Assert(flags & HASH_ELEM);
         Assert(info->keysize > 8);

         hashp->hash = string_hash;
@@ -529,16 +544,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
         hctl->dsize = info->dsize;
     }

-    /*
-     * hash table now allocates space for key and data but you have to say how
-     * much space to allocate
-     */
-    if (flags & HASH_ELEM)
-    {
-        Assert(info->entrysize >= info->keysize);
-        hctl->keysize = info->keysize;
-        hctl->entrysize = info->entrysize;
-    }
+    /* remember the entry sizes, too */
+    hctl->keysize = info->keysize;
+    hctl->entrysize = info->entrysize;

     /* make local copies of heavily-used constant fields */
     hashp->keysize = hctl->keysize;
@@ -617,10 +625,6 @@ hdefault(HTAB *hashp)
     hctl->dsize = DEF_DIRSIZE;
     hctl->nsegs = 0;

-    /* rather pointless defaults for key & entry size */
-    hctl->keysize = sizeof(char *);
-    hctl->entrysize = 2 * sizeof(char *);
-
     hctl->num_partitions = 0;    /* not partitioned */

     /* table has no fixed maximum size */
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 666ad33567..c3daaae92b 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -124,7 +124,7 @@ typedef struct
  * one of these.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
-                         HASHCTL *info, int flags);
+                         const HASHCTL *info, int flags);
 extern void hash_destroy(HTAB *hashp);
 extern void hash_stats(const char *where, HTAB *hashp);
 extern void *hash_search(HTAB *hashp, const void *keyPtr, HASHACTION action,

Here's a rolled-up patch that does some further documentation work
and gets rid of the unnecessary memset's as well.

            regards, tom lane

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 2dc9e44ae6..651227f510 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2607,7 +2607,8 @@ createConnHash(void)
     ctl.keysize = NAMEDATALEN;
     ctl.entrysize = sizeof(remoteConnHashEnt);

-    return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM);
+    return hash_create("Remote Con hash", NUMCONN, &ctl,
+                       HASH_ELEM | HASH_STRINGS);
 }

 static void
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 70cfdb2c9d..2f00344b7f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -567,7 +567,6 @@ pgss_shmem_startup(void)
         pgss->stats.dealloc = 0;
     }

-    memset(&info, 0, sizeof(info));
     info.keysize = sizeof(pgssHashKey);
     info.entrysize = sizeof(pgssEntry);
     pgss_hash = ShmemInitHash("pg_stat_statements hash",
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ab3226287d..66581e5414 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -119,14 +119,11 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     {
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(ConnCacheKey);
         ctl.entrysize = sizeof(ConnCacheEntry);
-        /* allocate ConnectionHash in the cache context */
-        ctl.hcxt = CacheMemoryContext;
         ConnectionHash = hash_create("postgres_fdw connections", 8,
                                      &ctl,
-                                     HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                                     HASH_ELEM | HASH_BLOBS);

         /*
          * Register some callback functions that manage connection cleanup.
diff --git a/contrib/postgres_fdw/shippable.c b/contrib/postgres_fdw/shippable.c
index 3433c19712..b4766dc5ff 100644
--- a/contrib/postgres_fdw/shippable.c
+++ b/contrib/postgres_fdw/shippable.c
@@ -93,7 +93,6 @@ InitializeShippableCache(void)
     HASHCTL        ctl;

     /* Create the hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(ShippableCacheKey);
     ctl.entrysize = sizeof(ShippableCacheEntry);
     ShippableCacheHash =
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 85986ec24a..e9a9741154 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -714,7 +714,6 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
     MemoryContext SPIcontext;

     /* initialize the category hash table */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = MAX_CATNAME_LEN;
     ctl.entrysize = sizeof(crosstab_HashEnt);
     ctl.hcxt = per_query_ctx;
@@ -726,7 +725,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
     crosstab_hash = hash_create("crosstab hash",
                                 INIT_CATS,
                                 &ctl,
-                                HASH_ELEM | HASH_CONTEXT);
+                                HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

     /* Connect to SPI manager */
     if ((ret = SPI_connect()) < 0)
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4ad67c88b4..217c199a14 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -76,7 +76,6 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
      * nodeBuffersTab hash is association between index blocks and it's
      * buffers.
      */
-    memset(&hashCtl, 0, sizeof(hashCtl));
     hashCtl.keysize = sizeof(BlockNumber);
     hashCtl.entrysize = sizeof(GISTNodeBuffer);
     hashCtl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494..c77a189907 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1363,7 +1363,6 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
     bool        found;

     /* Initialize hash tables used to track TIDs */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(ItemPointerData);
     hash_ctl.entrysize = sizeof(ItemPointerData);
     hash_ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 39e33763df..65942cc428 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -266,7 +266,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     state->rs_cxt = rw_cxt;

     /* Initialize hash tables used to track update chains */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(TidHashKey);
     hash_ctl.entrysize = sizeof(UnresolvedTupData);
     hash_ctl.hcxt = state->rs_cxt;
@@ -824,7 +823,6 @@ logical_begin_heap_rewrite(RewriteState state)
     state->rs_begin_lsn = GetXLogInsertRecPtr();
     state->rs_num_rewrite_mappings = 0;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(TransactionId);
     hash_ctl.entrysize = sizeof(RewriteMappingFile);
     hash_ctl.hcxt = state->rs_cxt;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 32a3099c1f..e0ca3859a9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -113,7 +113,6 @@ log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
         /* create hash table when first needed */
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(xl_invalid_page_key);
         ctl.entrysize = sizeof(xl_invalid_page);

diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c
index 6a2c6685a0..f2e7bab62a 100644
--- a/src/backend/catalog/pg_enum.c
+++ b/src/backend/catalog/pg_enum.c
@@ -188,7 +188,6 @@ init_enum_blacklist(void)
 {
     HASHCTL        hash_ctl;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = TopTransactionContext;
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 17f37eb39f..5c3c78a0e6 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -171,7 +171,6 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
                *rel_numparents;
     ListCell   *l;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(SeenRelsEntry);
     ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index c0763c63e2..e04afd9963 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2375,7 +2375,6 @@ AddEventToPendingNotifies(Notification *n)
         ListCell   *l;

         /* Create the hash table */
-        MemSet(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(Notification *);
         hash_ctl.entrysize = sizeof(NotificationHash);
         hash_ctl.hash = notification_hash;
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 4b18be5b27..89087a7be3 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -406,15 +406,13 @@ InitQueryHashTable(void)
 {
     HASHCTL        hash_ctl;

-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-
     hash_ctl.keysize = NAMEDATALEN;
     hash_ctl.entrysize = sizeof(PreparedStatement);

     prepared_queries = hash_create("Prepared Queries",
                                    32,
                                    &hash_ctl,
-                                   HASH_ELEM);
+                                   HASH_ELEM | HASH_STRINGS);
 }

 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 632b34af61..fa2eea8af2 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -1087,7 +1087,6 @@ create_seq_hashtable(void)
 {
     HASHCTL        ctl;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(SeqTableData);

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 86594bd056..97bfc8bd71 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -521,7 +521,6 @@ ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
     HTAB       *htab;
     int            i;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(SubplanResultRelHashElem);
     ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c
index ab04459c55..3a6cfc44d3 100644
--- a/src/backend/nodes/extensible.c
+++ b/src/backend/nodes/extensible.c
@@ -47,11 +47,11 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label,
     {
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(HASHCTL));
         ctl.keysize = EXTNODENAME_MAX_LEN;
         ctl.entrysize = sizeof(ExtensibleNodeEntry);

-        *p_htable = hash_create(htable_label, 100, &ctl, HASH_ELEM);
+        *p_htable = hash_create(htable_label, 100, &ctl,
+                                HASH_ELEM | HASH_STRINGS);
     }

     if (strlen(extnodename) >= EXTNODENAME_MAX_LEN)
diff --git a/src/backend/optimizer/util/predtest.c b/src/backend/optimizer/util/predtest.c
index 0edd873dca..d6e83e5f8e 100644
--- a/src/backend/optimizer/util/predtest.c
+++ b/src/backend/optimizer/util/predtest.c
@@ -1982,7 +1982,6 @@ lookup_proof_cache(Oid pred_op, Oid clause_op, bool refute_it)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(OprProofCacheKey);
         ctl.entrysize = sizeof(OprProofCacheEntry);
         OprProofCacheHash = hash_create("Btree proof lookup cache", 256,
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 76245c1ff3..9c9a738c80 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -400,7 +400,6 @@ build_join_rel_hash(PlannerInfo *root)
     ListCell   *l;

     /* Create the hash table */
-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Relids);
     hash_ctl.entrysize = sizeof(JoinHashEntry);
     hash_ctl.hash = bitmap_hash;
diff --git a/src/backend/parser/parse_oper.c b/src/backend/parser/parse_oper.c
index 6613a3a8f8..e72d3676f1 100644
--- a/src/backend/parser/parse_oper.c
+++ b/src/backend/parser/parse_oper.c
@@ -999,7 +999,6 @@ find_oper_cache_entry(OprCacheKey *key)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(OprCacheKey);
         ctl.entrysize = sizeof(OprCacheEntry);
         OprCacheHash = hash_create("Operator lookup cache", 256,
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 9a292290ed..5b0a15ac0b 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -286,13 +286,13 @@ CreatePartitionDirectory(MemoryContext mcxt)
     PartitionDirectory pdir;
     HASHCTL        ctl;

-    MemSet(&ctl, 0, sizeof(HASHCTL));
+    pdir = palloc(sizeof(PartitionDirectoryData));
+    pdir->pdir_mcxt = mcxt;
+
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(PartitionDirectoryEntry);
     ctl.hcxt = mcxt;

-    pdir = palloc(sizeof(PartitionDirectoryData));
-    pdir->pdir_mcxt = mcxt;
     pdir->pdir_hash = hash_create("partition directory", 256, &ctl,
                                   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e28944d2f..ed127a1032 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2043,7 +2043,6 @@ do_autovacuum(void)
     pg_class_desc = CreateTupleDescCopy(RelationGetDescr(classRel));

     /* create hash table for toast <-> main relid mapping */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(av_relation);

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef..a62c6d4d0a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1161,7 +1161,6 @@ CompactCheckpointerRequestQueue(void)
     skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);

     /* Initialize temporary hash table */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(CheckpointerRequest);
     ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
     ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7c75a25d21..6b60f293e9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1265,7 +1265,6 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     HeapTuple    tup;
     Snapshot    snapshot;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
@@ -1815,7 +1814,6 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         /* First time through - initialize function stat table */
         HASHCTL        hash_ctl;

-        memset(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(Oid);
         hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
         pgStatFunctions = hash_create("Function stat entries",
@@ -1975,7 +1973,6 @@ get_tabstat_entry(Oid rel_id, bool isshared)
     {
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(TabStatHashEntry);

@@ -4994,7 +4991,6 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
     dbentry->tables = hash_create("Per-database table",
@@ -5423,7 +5419,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Create the DB hashtable
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
     hash_ctl.hcxt = pgStatLocalContext;
@@ -5608,7 +5603,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                         break;
                 }

-                memset(&hash_ctl, 0, sizeof(hash_ctl));
                 hash_ctl.keysize = sizeof(Oid);
                 hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
                 hash_ctl.hcxt = pgStatLocalContext;
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 07aa52977f..f4dbbbe2dd 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -111,7 +111,6 @@ logicalrep_relmap_init(void)
                                   ALLOCSET_DEFAULT_SIZES);

     /* Initialize the relation hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(LogicalRepRelId);
     ctl.entrysize = sizeof(LogicalRepRelMapEntry);
     ctl.hcxt = LogicalRepRelMapContext;
@@ -120,7 +119,6 @@ logicalrep_relmap_init(void)
                                    HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

     /* Initialize the type hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(LogicalRepTyp);
     ctl.hcxt = LogicalRepRelMapContext;
@@ -606,7 +604,6 @@ logicalrep_partmap_init(void)
                                   ALLOCSET_DEFAULT_SIZES);

     /* Initialize the relation hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);    /* partition OID */
     ctl.entrysize = sizeof(LogicalRepPartMapEntry);
     ctl.hcxt = LogicalRepPartMapContext;
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 15dc51a94d..7359fa9df2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1619,8 +1619,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
     if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
         return;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-
     hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
     hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
     hash_ctl.hcxt = rb->context;
@@ -4116,7 +4114,6 @@ ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn)

     Assert(txn->toast_hash == NULL);

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(ReorderBufferToastEnt);
     hash_ctl.hcxt = rb->context;
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 1904f3471c..6259606537 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -372,7 +372,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     {
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(struct tablesync_start_time_mapping);
         last_start_times = hash_create("Logical replication table sync worker start times",
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997aed83..49d25b02d7 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -867,22 +867,18 @@ static void
 init_rel_sync_cache(MemoryContext cachectx)
 {
     HASHCTL        ctl;
-    MemoryContext old_ctxt;

     if (RelationSyncCache != NULL)
         return;

     /* Make a new hash table for the cache */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RelationSyncEntry);
     ctl.hcxt = cachectx;

-    old_ctxt = MemoryContextSwitchTo(cachectx);
     RelationSyncCache = hash_create("logical replication output relation cache",
                                     128, &ctl,
                                     HASH_ELEM | HASH_CONTEXT | HASH_BLOBS);
-    (void) MemoryContextSwitchTo(old_ctxt);

     Assert(RelationSyncCache != NULL);

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..c5e8707151 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2505,7 +2505,6 @@ InitBufferPoolAccess(void)

     memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));

-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(int32);
     hash_ctl.entrysize = sizeof(PrivateRefCountEntry);

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6ffd7b3306..cd3475e9e1 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -465,7 +465,6 @@ InitLocalBuffers(void)
     }

     /* Create the lookup hash table */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(BufferTag);
     info.entrysize = sizeof(LocalBufferLookupEnt);

diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0c2094f766..8700f7f19a 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -30,7 +30,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,

 typedef struct
 {
-    char        oid[OIDCHARS + 1];
+    Oid            reloid;            /* hash key */
 } unlogged_relation_entry;

 /*
@@ -172,10 +172,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
          * need to be reset.  Otherwise, this cleanup operation would be
          * O(n^2).
          */
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(unlogged_relation_entry);
+        ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(unlogged_relation_entry);
-        hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
+        ctl.hcxt = CurrentMemoryContext;
+        hash = hash_create("unlogged relation OIDs", 32, &ctl,
+                           HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

         /* Scan the directory. */
         dbspace_dir = AllocateDir(dbspacedirname);
@@ -198,9 +199,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
              * Put the OID portion of the name into the hash table, if it
              * isn't already.
              */
-            memset(ent.oid, 0, sizeof(ent.oid));
-            memcpy(ent.oid, de->d_name, oidchars);
-            hash_search(hash, &ent, HASH_ENTER, NULL);
+            ent.reloid = atooid(de->d_name);
+            (void) hash_search(hash, &ent, HASH_ENTER, NULL);
         }

         /* Done with the first pass. */
@@ -224,7 +224,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
         {
             ForkNumber    forkNum;
             int            oidchars;
-            bool        found;
             unlogged_relation_entry ent;

             /* Skip anything that doesn't look like a relation data file. */
@@ -238,14 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)

             /*
              * See whether the OID portion of the name shows up in the hash
-             * table.
+             * table.  If so, nuke it!
              */
-            memset(ent.oid, 0, sizeof(ent.oid));
-            memcpy(ent.oid, de->d_name, oidchars);
-            hash_search(hash, &ent, HASH_FIND, &found);
-
-            /* If so, nuke it! */
-            if (found)
+            ent.reloid = atooid(de->d_name);
+            if (hash_search(hash, &ent, HASH_FIND, NULL))
             {
                 snprintf(rm_path, sizeof(rm_path), "%s/%s",
                          dbspacedirname, de->d_name);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 97716f6aef..b0fc9f160d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -292,7 +292,6 @@ void
 InitShmemIndex(void)
 {
     HASHCTL        info;
-    int            hash_flags;

     /*
      * Create the shared memory shmem index.
@@ -304,11 +303,11 @@ InitShmemIndex(void)
      */
     info.keysize = SHMEM_INDEX_KEYSIZE;
     info.entrysize = sizeof(ShmemIndexEnt);
-    hash_flags = HASH_ELEM;

     ShmemIndex = ShmemInitHash("ShmemIndex",
                                SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
-                               &info, hash_flags);
+                               &info,
+                               HASH_ELEM | HASH_STRINGS);
 }

 /*
@@ -329,6 +328,11 @@ InitShmemIndex(void)
  * whose maximum size is certain, this should be equal to max_size; that
  * ensures that no run-time out-of-shared-memory failures can occur.
  *
+ * *infoP and hash_flags should specify at least the entry sizes and key
+ * comparison semantics (see hash_create()).  Flag bits and values specific
+ * to shared-memory hash tables are added here, except that callers may
+ * choose to specify HASH_PARTITION and/or HASH_FIXED_SIZE.
+ *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 52b2809dac..4ea3cf1f5c 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -81,7 +81,6 @@ InitRecoveryTransactionEnvironment(void)
      * Initialize the hash table for tracking the list of locks held by each
      * transaction.
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(TransactionId);
     hash_ctl.entrysize = sizeof(RecoveryLockListsEntry);
     RecoveryLockLists = hash_create("RecoveryLockLists",
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f455..53472dd21e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -419,7 +419,6 @@ InitLocks(void)
      * Allocate hash table for LOCK structs.  This stores per-locked-object
      * information.
      */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(LOCKTAG);
     info.entrysize = sizeof(LOCK);
     info.num_partitions = NUM_LOCK_PARTITIONS;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 108e652179..26bcce9735 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -342,7 +342,6 @@ init_lwlock_stats(void)
                                              ALLOCSET_DEFAULT_SIZES);
     MemoryContextAllowInCriticalSection(lwlock_stats_cxt, true);

-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(lwlock_stats_key);
     ctl.entrysize = sizeof(lwlock_stats);
     ctl.hcxt = lwlock_stats_cxt;
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 8a365b400c..e42e131543 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1096,7 +1096,6 @@ InitPredicateLocks(void)
      * Allocate hash table for PREDICATELOCKTARGET structs.  This stores
      * per-predicate-lock-target information.
      */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(PREDICATELOCKTARGETTAG);
     info.entrysize = sizeof(PREDICATELOCKTARGET);
     info.num_partitions = NUM_PREDICATELOCK_PARTITIONS;
@@ -1129,7 +1128,6 @@ InitPredicateLocks(void)
      * Allocate hash table for PREDICATELOCK structs.  This stores per
      * xact-lock-of-a-target information.
      */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(PREDICATELOCKTAG);
     info.entrysize = sizeof(PREDICATELOCK);
     info.hash = predicatelock_hash;
@@ -1212,7 +1210,6 @@ InitPredicateLocks(void)
      * Allocate hash table for SERIALIZABLEXID structs.  This stores per-xid
      * information for serializable transactions which have accessed data.
      */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(SERIALIZABLEXIDTAG);
     info.entrysize = sizeof(SERIALIZABLEXID);

@@ -1853,7 +1850,6 @@ CreateLocalPredicateLockHash(void)

     /* Initialize the backend-local hash table of parent locks */
     Assert(LocalPredicateLockHash == NULL);
-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(PREDICATELOCKTARGETTAG);
     hash_ctl.entrysize = sizeof(LOCALPREDICATELOCK);
     LocalPredicateLockHash = hash_create("Local predicate lock",
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..072bdd118f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -154,7 +154,6 @@ smgropen(RelFileNode rnode, BackendId backend)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(RelFileNodeBackend);
         ctl.entrysize = sizeof(SMgrRelationData);
         SMgrRelationHash = hash_create("smgr relation table", 400,
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 1d635d596c..a49588f6b9 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -150,7 +150,6 @@ InitSync(void)
                                               ALLOCSET_DEFAULT_SIZES);
         MemoryContextAllowInCriticalSection(pendingOpsCxt, true);

-        MemSet(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(FileTag);
         hash_ctl.entrysize = sizeof(PendingFsyncEntry);
         hash_ctl.hcxt = pendingOpsCxt;
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 2eed0cd137..19e9611a3a 100644
--- a/src/backend/tsearch/ts_typanalyze.c
+++ b/src/backend/tsearch/ts_typanalyze.c
@@ -180,7 +180,6 @@ compute_tsvector_stats(VacAttrStats *stats,
      * worry about overflowing the initial size. Also we don't need to pay any
      * attention to locking and memory management.
      */
-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(LexemeHashKey);
     hash_ctl.entrysize = sizeof(TrackItem);
     hash_ctl.hash = lexeme_hash;
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 4912cabc61..cb2a834193 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -277,7 +277,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
      * worry about overflowing the initial size. Also we don't need to pay any
      * attention to locking and memory management.
      */
-    MemSet(&elem_hash_ctl, 0, sizeof(elem_hash_ctl));
     elem_hash_ctl.keysize = sizeof(Datum);
     elem_hash_ctl.entrysize = sizeof(TrackItem);
     elem_hash_ctl.hash = element_hash;
@@ -289,7 +288,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
                                HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);

     /* hashtable for array distinct elements counts */
-    MemSet(&count_hash_ctl, 0, sizeof(count_hash_ctl));
     count_hash_ctl.keysize = sizeof(int);
     count_hash_ctl.entrysize = sizeof(DECountItem);
     count_hash_ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 12557ce3af..7a25415078 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -3439,14 +3439,13 @@ get_json_object_as_hash(char *json, int len, const char *funcname)
     JsonLexContext *lex = makeJsonLexContextCstringLen(json, len, GetDatabaseEncoding(), true);
     JsonSemAction *sem;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = NAMEDATALEN;
     ctl.entrysize = sizeof(JsonHashEntry);
     ctl.hcxt = CurrentMemoryContext;
     tab = hash_create("json object hashtable",
                       100,
                       &ctl,
-                      HASH_ELEM | HASH_CONTEXT);
+                      HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

     state = palloc0(sizeof(JHashState));
     sem = palloc0(sizeof(JsonSemAction));
@@ -3831,14 +3830,13 @@ populate_recordset_object_start(void *state)
         return;

     /* Object at level 1: set up a new hash table for this object */
-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = NAMEDATALEN;
     ctl.entrysize = sizeof(JsonHashEntry);
     ctl.hcxt = CurrentMemoryContext;
     _state->json_hash = hash_create("json object hashtable",
                                     100,
                                     &ctl,
-                                    HASH_ELEM | HASH_CONTEXT);
+                                    HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 }

 static void
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index b6d05ac98d..c39d67645c 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1297,7 +1297,6 @@ lookup_collation_cache(Oid collation, bool set_flags)
         /* First time through, initialize the hash table */
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(collation_cache_entry);
         collation_cache = hash_create("Collation cache", 100, &ctl,
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 02b1a3868f..5ab134a853 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -2540,7 +2540,6 @@ ri_InitHashTables(void)
 {
     HASHCTL        ctl;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RI_ConstraintInfo);
     ri_constraint_cache = hash_create("RI constraint cache",
@@ -2552,14 +2551,12 @@ ri_InitHashTables(void)
                                   InvalidateConstraintCacheCallBack,
                                   (Datum) 0);

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(RI_QueryKey);
     ctl.entrysize = sizeof(RI_QueryHashEntry);
     ri_query_cache = hash_create("RI query cache",
                                  RI_INIT_QUERYHASHSIZE,
                                  &ctl, HASH_ELEM | HASH_BLOBS);

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(RI_CompareKey);
     ctl.entrysize = sizeof(RI_CompareHashEntry);
     ri_compare_cache = hash_create("RI compare cache",
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ad582f99a5..7d4443e807 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -3464,14 +3464,14 @@ set_rtable_names(deparse_namespace *dpns, List *parent_namespaces,
      * We use a hash table to hold known names, so that this process is O(N)
      * not O(N^2) for N names.
      */
-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = NAMEDATALEN;
     hash_ctl.entrysize = sizeof(NameHashEntry);
     hash_ctl.hcxt = CurrentMemoryContext;
     names_hash = hash_create("set_rtable_names names",
                              list_length(dpns->rtable),
                              &hash_ctl,
-                             HASH_ELEM | HASH_CONTEXT);
+                             HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
     /* Preload the hash table with names appearing in parent_namespaces */
     foreach(lc, parent_namespaces)
     {
diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c
index 05ac366b40..934a84e03f 100644
--- a/src/backend/utils/cache/attoptcache.c
+++ b/src/backend/utils/cache/attoptcache.c
@@ -79,7 +79,6 @@ InitializeAttoptCache(void)
     HASHCTL        ctl;

     /* Initialize the hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(AttoptCacheKey);
     ctl.entrysize = sizeof(AttoptCacheEntry);
     AttoptCacheHash =
diff --git a/src/backend/utils/cache/evtcache.c b/src/backend/utils/cache/evtcache.c
index 0427795395..0877bc7e0e 100644
--- a/src/backend/utils/cache/evtcache.c
+++ b/src/backend/utils/cache/evtcache.c
@@ -118,7 +118,6 @@ BuildEventTriggerCache(void)
     EventTriggerCacheState = ETCS_REBUILD_STARTED;

     /* Create new hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(EventTriggerEvent);
     ctl.entrysize = sizeof(EventTriggerCacheEntry);
     ctl.hcxt = EventTriggerCacheContext;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66393becfb..3bd5e18042 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1607,7 +1607,6 @@ LookupOpclassInfo(Oid operatorClassOid,
         /* First time through: initialize the opclass cache */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(OpClassCacheEnt);
         OpClassCache = hash_create("Operator class cache", 64,
@@ -3775,7 +3774,6 @@ RelationCacheInitialize(void)
     /*
      * create hashtable that indexes the relcache
      */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RelIdCacheEnt);
     RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE,
diff --git a/src/backend/utils/cache/relfilenodemap.c b/src/backend/utils/cache/relfilenodemap.c
index 0dbdbff603..38e6379974 100644
--- a/src/backend/utils/cache/relfilenodemap.c
+++ b/src/backend/utils/cache/relfilenodemap.c
@@ -110,17 +110,15 @@ InitializeRelfilenodeMap(void)
     relfilenode_skey[0].sk_attno = Anum_pg_class_reltablespace;
     relfilenode_skey[1].sk_attno = Anum_pg_class_relfilenode;

-    /* Initialize the hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
-    ctl.keysize = sizeof(RelfilenodeMapKey);
-    ctl.entrysize = sizeof(RelfilenodeMapEntry);
-    ctl.hcxt = CacheMemoryContext;
-
     /*
      * Only create the RelfilenodeMapHash now, so we don't end up partially
      * initialized when fmgr_info_cxt() above ERRORs out with an out of memory
      * error.
      */
+    ctl.keysize = sizeof(RelfilenodeMapKey);
+    ctl.entrysize = sizeof(RelfilenodeMapEntry);
+    ctl.hcxt = CacheMemoryContext;
+
     RelfilenodeMapHash =
         hash_create("RelfilenodeMap cache", 64, &ctl,
                     HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index e0c3c1b1c1..c8387e2541 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -79,7 +79,6 @@ InitializeTableSpaceCache(void)
     HASHCTL        ctl;

     /* Initialize the hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(TableSpaceCacheEntry);
     TableSpaceCacheHash =
diff --git a/src/backend/utils/cache/ts_cache.c b/src/backend/utils/cache/ts_cache.c
index f9f7912cb8..a2867fac7d 100644
--- a/src/backend/utils/cache/ts_cache.c
+++ b/src/backend/utils/cache/ts_cache.c
@@ -117,7 +117,6 @@ lookup_ts_parser_cache(Oid prsId)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(TSParserCacheEntry);
         TSParserCacheHash = hash_create("Tsearch parser cache", 4,
@@ -215,7 +214,6 @@ lookup_ts_dictionary_cache(Oid dictId)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(TSDictionaryCacheEntry);
         TSDictionaryCacheHash = hash_create("Tsearch dictionary cache", 8,
@@ -365,7 +363,6 @@ init_ts_config_cache(void)
 {
     HASHCTL        ctl;

-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(TSConfigCacheEntry);
     TSConfigCacheHash = hash_create("Tsearch configuration cache", 16,
diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index 5883fde367..1e331098c0 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -341,7 +341,6 @@ lookup_type_cache(Oid type_id, int flags)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(TypeCacheEntry);
         TypeCacheHash = hash_create("Type information cache", 64,
@@ -1874,7 +1873,6 @@ assign_record_type_typmod(TupleDesc tupDesc)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(TupleDesc);    /* just the pointer */
         ctl.entrysize = sizeof(RecordCacheEntry);
         ctl.hash = record_type_typmod_hash;
diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c
index bd779fdaf7..adb31e109f 100644
--- a/src/backend/utils/fmgr/dfmgr.c
+++ b/src/backend/utils/fmgr/dfmgr.c
@@ -680,13 +680,12 @@ find_rendezvous_variable(const char *varName)
     {
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = NAMEDATALEN;
         ctl.entrysize = sizeof(rendezvousHashEntry);
         rendezvousHash = hash_create("Rendezvous variable hash",
                                      16,
                                      &ctl,
-                                     HASH_ELEM);
+                                     HASH_ELEM | HASH_STRINGS);
     }

     /* Find or create the hashtable entry for this varName */
diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index 2681b7fbc6..fa5f7ac615 100644
--- a/src/backend/utils/fmgr/fmgr.c
+++ b/src/backend/utils/fmgr/fmgr.c
@@ -565,7 +565,6 @@ record_C_func(HeapTuple procedureTuple,
     {
         HASHCTL        hash_ctl;

-        MemSet(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(Oid);
         hash_ctl.entrysize = sizeof(CFuncHashTabEntry);
         CFuncHash = hash_create("CFuncHash",
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index d14d875c93..fbd849b8f7 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -30,11 +30,12 @@
  * dynahash.c provides support for these types of lookup keys:
  *
  * 1. Null-terminated C strings (truncated if necessary to fit in keysize),
- * compared as though by strcmp().  This is the default behavior.
+ * compared as though by strcmp().  This is selected by specifying the
+ * HASH_STRINGS flag to hash_create.
  *
  * 2. Arbitrary binary data of size keysize, compared as though by memcmp().
  * (Caller must ensure there are no undefined padding bits in the keys!)
- * This is selected by specifying HASH_BLOBS flag to hash_create.
+ * This is selected by specifying the HASH_BLOBS flag to hash_create.
  *
  * 3. More complex key behavior can be selected by specifying user-supplied
  * hashing, comparison, and/or key-copying functions.  At least a hashing
@@ -47,8 +48,8 @@
  *   locks.
  * - Shared memory hashes are allocated in a fixed size area at startup and
  *   are discoverable by name from other processes.
- * - Because entries don't need to be moved in the case of hash conflicts, has
- *   better performance for large entries
+ * - Because entries don't need to be moved in the case of hash conflicts,
+ *   dynahash has better performance for large entries.
  * - Guarantees stable pointers to entries.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
@@ -316,6 +317,28 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *    *info: additional table parameters, as indicated by flags
  *    flags: bitmask indicating which parameters to take from *info
  *
+ * The flags value *must* include HASH_ELEM.  (Formerly, this was nominally
+ * optional, but the default keysize and entrysize values were useless.)
+ * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * or HASH_FUNCTION, to define the key hashing semantics (C strings,
+ * binary blobs, or custom, respectively).  Callers specifying a custom
+ * hash function will likely also want to use HASH_COMPARE, and perhaps
+ * also HASH_KEYCOPY, to control key comparison and copying.
+ * Another often-used flag is HASH_CONTEXT, to allocate the hash table
+ * under info->hcxt rather than under TopMemoryContext; the default
+ * behavior is only suitable for session-lifespan hash tables.
+ * Other flags bits are special-purpose and seldom used, except for those
+ * associated with shared-memory hash tables, for which see ShmemInitHash().
+ *
+ * Fields in *info are read only when the associated flags bit is set.
+ * It is not necessary to initialize other fields of *info.
+ * Neither tabname nor *info need persist after the hash_create() call.
+ *
+ * Note: It is deprecated for callers of hash_create() to explicitly specify
+ * string_hash, tag_hash, uint32_hash, or oid_hash.  Just set HASH_BLOBS or
+ * HASH_STRINGS.  Use HASH_FUNCTION only when you want something other than
+ * one of these.
+ *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
  * hashtable can be expanded on-the-fly, so it's better for nelem to be
@@ -323,11 +346,19 @@ string_compare(const char *key1, const char *key2, Size keysize)
  * large nelem will penalize hash_seq_search speed without buying much.
  */
 HTAB *
-hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
+hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
     HTAB       *hashp;
     HASHHDR    *hctl;

+    /*
+     * Hash tables now allocate space for key and data, but you have to say
+     * how much space to allocate.
+     */
+    Assert(flags & HASH_ELEM);
+    Assert(info->keysize > 0);
+    Assert(info->entrysize >= info->keysize);
+
     /*
      * For shared hash tables, we have a local hash header (HTAB struct) that
      * we allocate in TopMemoryContext; all else is in shared memory.
@@ -370,28 +401,43 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
      * Select the appropriate hash function (see comments at head of file).
      */
     if (flags & HASH_FUNCTION)
+    {
+        Assert(!(flags & (HASH_BLOBS | HASH_STRINGS)));
         hashp->hash = info->hash;
+    }
     else if (flags & HASH_BLOBS)
     {
+        Assert(!(flags & HASH_STRINGS));
         /* We can optimize hashing for common key sizes */
-        Assert(flags & HASH_ELEM);
         if (info->keysize == sizeof(uint32))
             hashp->hash = uint32_hash;
         else
             hashp->hash = tag_hash;
     }
     else
-        hashp->hash = string_hash;    /* default hash function */
+    {
+        /*
+         * string_hash used to be considered the default hash method, and in a
+         * non-assert build it effectively still is.  But we now consider it
+         * an assertion error to not say HASH_STRINGS explicitly.  To help
+         * catch mistaken usage of HASH_STRINGS, we also insist on a
+         * reasonably long string length: if the keysize is only 4 or 8 bytes,
+         * it's almost certainly an integer or pointer not a string.
+         */
+        Assert(flags & HASH_STRINGS);
+        Assert(info->keysize > 8);
+
+        hashp->hash = string_hash;
+    }

     /*
      * If you don't specify a match function, it defaults to string_compare if
-     * you used string_hash (either explicitly or by default) and to memcmp
-     * otherwise.
+     * you used string_hash, and to memcmp otherwise.
      *
      * Note: explicitly specifying string_hash is deprecated, because this
      * might not work for callers in loadable modules on some platforms due to
      * referencing a trampoline instead of the string_hash function proper.
-     * Just let it default, eh?
+     * Specify HASH_STRINGS instead.
      */
     if (flags & HASH_COMPARE)
         hashp->match = info->match;
@@ -505,16 +551,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
         hctl->dsize = info->dsize;
     }

-    /*
-     * hash table now allocates space for key and data but you have to say how
-     * much space to allocate
-     */
-    if (flags & HASH_ELEM)
-    {
-        Assert(info->entrysize >= info->keysize);
-        hctl->keysize = info->keysize;
-        hctl->entrysize = info->entrysize;
-    }
+    /* remember the entry sizes, too */
+    hctl->keysize = info->keysize;
+    hctl->entrysize = info->entrysize;

     /* make local copies of heavily-used constant fields */
     hashp->keysize = hctl->keysize;
@@ -593,10 +632,6 @@ hdefault(HTAB *hashp)
     hctl->dsize = DEF_DIRSIZE;
     hctl->nsegs = 0;

-    /* rather pointless defaults for key & entry size */
-    hctl->keysize = sizeof(char *);
-    hctl->entrysize = 2 * sizeof(char *);
-
     hctl->num_partitions = 0;    /* not partitioned */

     /* table has no fixed maximum size */
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index ec6f80ee99..283dfe2d9e 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -119,7 +119,7 @@ EnablePortalManager(void)
      * create, initially
      */
     PortalHashTable = hash_create("Portal hash", PORTALS_PER_USER,
-                                  &ctl, HASH_ELEM);
+                                  &ctl, HASH_ELEM | HASH_STRINGS);
 }

 /*
diff --git a/src/backend/utils/time/combocid.c b/src/backend/utils/time/combocid.c
index 4ee9ef0ffe..9626f98100 100644
--- a/src/backend/utils/time/combocid.c
+++ b/src/backend/utils/time/combocid.c
@@ -223,7 +223,6 @@ GetComboCommandId(CommandId cmin, CommandId cmax)
         sizeComboCids = CCID_ARRAY_SIZE;
         usedComboCids = 0;

-        memset(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(ComboCidKeyData);
         hash_ctl.entrysize = sizeof(ComboCidEntryData);
         hash_ctl.hcxt = TopTransactionContext;
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index bebf89b3c4..13c6602217 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -64,25 +64,36 @@ typedef struct HTAB HTAB;
 /* Only those fields indicated by hash_flags need be set */
 typedef struct HASHCTL
 {
+    /* Used if HASH_PARTITION flag is set: */
     long        num_partitions; /* # partitions (must be power of 2) */
+    /* Used if HASH_SEGMENT flag is set: */
     long        ssize;            /* segment size */
+    /* Used if HASH_DIRSIZE flag is set: */
     long        dsize;            /* (initial) directory size */
     long        max_dsize;        /* limit to dsize if dir size is limited */
+    /* Used if HASH_ELEM flag is set (which is now required): */
     Size        keysize;        /* hash key length in bytes */
     Size        entrysize;        /* total user element size in bytes */
+    /* Used if HASH_FUNCTION flag is set: */
     HashValueFunc hash;            /* hash function */
+    /* Used if HASH_COMPARE flag is set: */
     HashCompareFunc match;        /* key comparison function */
+    /* Used if HASH_KEYCOPY flag is set: */
     HashCopyFunc keycopy;        /* key copying function */
+    /* Used if HASH_ALLOC flag is set: */
     HashAllocFunc alloc;        /* memory allocator */
+    /* Used if HASH_CONTEXT flag is set: */
     MemoryContext hcxt;            /* memory context to use for allocations */
+    /* Used if HASH_SHARED_MEM flag is set: */
     HASHHDR    *hctl;            /* location of header in shared mem */
 } HASHCTL;

-/* Flags to indicate which parameters are supplied */
+/* Flag bits for hash_create; most indicate which parameters are supplied */
 #define HASH_PARTITION    0x0001    /* Hashtable is used w/partitioned locking */
 #define HASH_SEGMENT    0x0002    /* Set segment size */
 #define HASH_DIRSIZE    0x0004    /* Set directory size (initial and max) */
-#define HASH_ELEM        0x0010    /* Set keysize and entrysize */
+#define HASH_ELEM        0x0008    /* Set keysize and entrysize (now required!) */
+#define HASH_STRINGS    0x0010    /* Select support functions for string keys */
 #define HASH_BLOBS        0x0020    /* Select support functions for binary keys */
 #define HASH_FUNCTION    0x0040    /* Set user defined hash function */
 #define HASH_COMPARE    0x0080    /* Set user defined comparison function */
@@ -93,7 +104,6 @@ typedef struct HASHCTL
 #define HASH_ATTACH        0x1000    /* Do not initialize hctl */
 #define HASH_FIXED_SIZE 0x2000    /* Initial size is a hard limit */

-
 /* max_dsize value to indicate expansible directory */
 #define NO_MAX_DSIZE            (-1)

@@ -116,13 +126,9 @@ typedef struct

 /*
  * prototypes for functions in dynahash.c
- *
- * Note: It is deprecated for callers of hash_create to explicitly specify
- * string_hash, tag_hash, uint32_hash, or oid_hash.  Just set HASH_BLOBS or
- * not.  Use HASH_FUNCTION only when you want something other than those.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
-                         HASHCTL *info, int flags);
+                         const HASHCTL *info, int flags);
 extern void hash_destroy(HTAB *hashp);
 extern void hash_stats(const char *where, HTAB *hashp);
 extern void *hash_search(HTAB *hashp, const void *keyPtr, HASHACTION action,
diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c
index 4de756455d..6299adf71a 100644
--- a/src/pl/plperl/plperl.c
+++ b/src/pl/plperl/plperl.c
@@ -458,7 +458,6 @@ _PG_init(void)
     /*
      * Create hash tables.
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(plperl_interp_desc);
     plperl_interp_hash = hash_create("PL/Perl interpreters",
@@ -466,7 +465,6 @@ _PG_init(void)
                                      &hash_ctl,
                                      HASH_ELEM | HASH_BLOBS);

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(plperl_proc_key);
     hash_ctl.entrysize = sizeof(plperl_proc_ptr);
     plperl_proc_hash = hash_create("PL/Perl procedures",
@@ -580,13 +578,12 @@ select_perl_context(bool trusted)
     {
         HASHCTL        hash_ctl;

-        memset(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = NAMEDATALEN;
         hash_ctl.entrysize = sizeof(plperl_query_entry);
         interp_desc->query_hash = hash_create("PL/Perl queries",
                                               32,
                                               &hash_ctl,
-                                              HASH_ELEM);
+                                              HASH_ELEM | HASH_STRINGS);
     }

     /*
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index b610b28d70..555da952e1 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -2567,7 +2567,6 @@ plpgsql_HashTableInit(void)
     /* don't allow double-initialization */
     Assert(plpgsql_HashTable == NULL);

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(PLpgSQL_func_hashkey);
     ctl.entrysize = sizeof(plpgsql_HashEnt);
     plpgsql_HashTable = hash_create("PLpgSQL function hash",
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index ccbc50fc45..112f6ab0ae 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -4058,7 +4058,6 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
     {
         estate->simple_eval_estate = simple_eval_estate;
         /* Private cast hash just lives in function's main context */
-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(plpgsql_CastHashKey);
         ctl.entrysize = sizeof(plpgsql_CastHashEntry);
         ctl.hcxt = CurrentMemoryContext;
@@ -4077,7 +4076,6 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
             shared_cast_context = AllocSetContextCreate(TopMemoryContext,
                                                         "PLpgSQL cast info",
                                                         ALLOCSET_DEFAULT_SIZES);
-            memset(&ctl, 0, sizeof(ctl));
             ctl.keysize = sizeof(plpgsql_CastHashKey);
             ctl.entrysize = sizeof(plpgsql_CastHashEntry);
             ctl.hcxt = shared_cast_context;
diff --git a/src/pl/plpython/plpy_plpymodule.c b/src/pl/plpython/plpy_plpymodule.c
index 7f54d093ac..0365acc95b 100644
--- a/src/pl/plpython/plpy_plpymodule.c
+++ b/src/pl/plpython/plpy_plpymodule.c
@@ -214,7 +214,6 @@ PLy_add_exceptions(PyObject *plpy)
     PLy_exc_spi_error = PLy_create_exception("plpy.SPIError", NULL, NULL,
                                              "SPIError", plpy);

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(int);
     hash_ctl.entrysize = sizeof(PLyExceptionEntry);
     PLy_spi_exceptions = hash_create("PL/Python SPI exceptions", 256,
diff --git a/src/pl/plpython/plpy_procedure.c b/src/pl/plpython/plpy_procedure.c
index 1f05c633ef..b7c0b5cebe 100644
--- a/src/pl/plpython/plpy_procedure.c
+++ b/src/pl/plpython/plpy_procedure.c
@@ -34,7 +34,6 @@ init_procedure_caches(void)
 {
     HASHCTL        hash_ctl;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(PLyProcedureKey);
     hash_ctl.entrysize = sizeof(PLyProcedureEntry);
     PLy_procedure_cache = hash_create("PL/Python procedures", 32, &hash_ctl,
diff --git a/src/pl/tcl/pltcl.c b/src/pl/tcl/pltcl.c
index a3a2dc8e89..e11837559d 100644
--- a/src/pl/tcl/pltcl.c
+++ b/src/pl/tcl/pltcl.c
@@ -439,7 +439,6 @@ _PG_init(void)
     /************************************************************
      * Create the hash table for working interpreters
      ************************************************************/
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(pltcl_interp_desc);
     pltcl_interp_htab = hash_create("PL/Tcl interpreters",
@@ -450,7 +449,6 @@ _PG_init(void)
     /************************************************************
      * Create the hash table for function lookup
      ************************************************************/
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(pltcl_proc_key);
     hash_ctl.entrysize = sizeof(pltcl_proc_ptr);
     pltcl_proc_htab = hash_create("PL/Tcl functions",
diff --git a/src/timezone/pgtz.c b/src/timezone/pgtz.c
index 3f0fb51e91..4a360f5077 100644
--- a/src/timezone/pgtz.c
+++ b/src/timezone/pgtz.c
@@ -203,15 +203,13 @@ init_timezone_hashtable(void)
 {
     HASHCTL        hash_ctl;

-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-
     hash_ctl.keysize = TZ_STRLEN_MAX + 1;
     hash_ctl.entrysize = sizeof(pg_tz_cache);

     timezone_cache = hash_create("Timezones",
                                  4,
                                  &hash_ctl,
-                                 HASH_ELEM);
+                                 HASH_ELEM | HASH_STRINGS);
     if (!timezone_cache)
         return false;


On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote:
> * Should we just have a blanket insistence that all callers supply
> HASH_ELEM?  The default sizes that dynahash.c uses without that are
> undocumented and basically useless.

+1

> we should just rip out all those memsets as pointless, since there's
> basically no case where you'd use the memset to fill a field that
> you meant to pass as zero.  The fact that hash_create() doesn't
> read fields it's not told to by a flag means we should not need
> the memsets to avoid uninitialized-memory reads.

On Mon, Dec 14, 2020 at 06:55:20PM -0500, Tom Lane wrote:
> Here's a rolled-up patch that does some further documentation work
> and gets rid of the unnecessary memset's as well.

+1 on removing the memset() calls.  That said, it's not a big deal if more
creep in over time; it doesn't qualify as a project policy violation.

> @@ -329,6 +328,11 @@ InitShmemIndex(void)
>   * whose maximum size is certain, this should be equal to max_size; that
>   * ensures that no run-time out-of-shared-memory failures can occur.
>   *
> + * *infoP and hash_flags should specify at least the entry sizes and key

s/should/must/



Noah Misch <noah@leadboat.com> writes:
> On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote:
>> Here's a rolled-up patch that does some further documentation work
>> and gets rid of the unnecessary memset's as well.

> +1 on removing the memset() calls.  That said, it's not a big deal if more
> creep in over time; it doesn't qualify as a project policy violation.

Right, that part is just neatnik-ism.  Neither the calls with memset
nor the ones without are buggy.

>> + * *infoP and hash_flags should specify at least the entry sizes and key

> s/should/must/

OK; thanks for reviewing!

            regards, tom lane



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
Tom Lane has raised a complaint on pgsql-commiters [1] about one of
the commits related to this work [2]. The new member wrasse is showing
Warning:

"/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c",
line 2510: Warning: Likely null pointer dereference (*(curtxn+272)):
ReorderBufferProcessTXN

The Warning is for line:
curtxn->concurrent_abort = true;

Now, we can simply fix this warning by adding an if check like:
if (curtxn)
curtxn->concurrent_abort = true;

However, on further discussion, it seems that is not sufficient here
because the callbacks can throw the surrounding error code
(ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for
a completely different scenario. I think here we need a
stronger check to ensure that we set concurrent abort flag and do
other things in that check only when we are decoding non-committed
xacts. The idea I have is to additionally check that we are decoding
streaming or prepared transaction (the same check as we have for
setting curtxn) or we can check if CheckXidAlive is a valid
transaction id. What do you think?

[1] - https://www.postgresql.org/message-id/2752962.1619568098%40sss.pgh.pa.us
[2] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Tom Lane has raised a complaint on pgsql-commiters [1] about one of
> the commits related to this work [2]. The new member wrasse is showing
> Warning:
>
> "/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c",
> line 2510: Warning: Likely null pointer dereference (*(curtxn+272)):
> ReorderBufferProcessTXN
>
> The Warning is for line:
> curtxn->concurrent_abort = true;
>
> Now, we can simply fix this warning by adding an if check like:
> if (curtxn)
> curtxn->concurrent_abort = true;
>
> However, on further discussion, it seems that is not sufficient here
> because the callbacks can throw the surrounding error code
> (ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for
> a completely different scenario. I think here we need a
> stronger check to ensure that we set concurrent abort flag and do
> other things in that check only when we are decoding non-committed
> xacts.

That makes sense.

 The idea I have is to additionally check that we are decoding
> streaming or prepared transaction (the same check as we have for
> setting curtxn) or we can check if CheckXidAlive is a valid
> transaction id. What do you think?

I think a check based on CheckXidAlive looks good to me.  This will
protect against if a similar error is raised from any other path as
you mentioned above.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
>  The idea I have is to additionally check that we are decoding
> > streaming or prepared transaction (the same check as we have for
> > setting curtxn) or we can check if CheckXidAlive is a valid
> > transaction id. What do you think?
>
> I think a check based on CheckXidAlive looks good to me.  This will
> protect against if a similar error is raised from any other path as
> you mentioned above.
>

We can't use CheckXidAlive because it is reset by that time. So, I
used the other approach which led to the attached.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Fri, Apr 30, 2021 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> >  The idea I have is to additionally check that we are decoding
> > > streaming or prepared transaction (the same check as we have for
> > > setting curtxn) or we can check if CheckXidAlive is a valid
> > > transaction id. What do you think?
> >
> > I think a check based on CheckXidAlive looks good to me.  This will
> > protect against if a similar error is raised from any other path as
> > you mentioned above.
> >
>
> We can't use CheckXidAlive because it is reset by that time.

Right.

So, I
> used the other approach which led to the attached.

The patch looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Amit Kapila
Дата:
On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> So, I
> > used the other approach which led to the attached.
>
> The patch looks fine to me.
>

Thanks, pushed!

-- 
With Regards,
Amit Kapila.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От
Dilip Kumar
Дата:
On Thu, May 6, 2021 at 9:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > So, I
> > > used the other approach which led to the attached.
> >
> > The patch looks fine to me.
> >
>
> Thanks, pushed!

Thanks!



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com