Обсуждение: PATCH: logical_work_mem and logical streaming of large in-progresstransactions

Поиск

Список

Период

Сортировка

PATCH: logical_work_mem and logical streaming of large in-progresstransactions

От

Tomas Vondra

Дата:

23 декабря 2017 г., 10:57:43

Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.


PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer.

Secondly, it adds a simple memory accounting by tracking the amount of
memory used in total (for the whole reorder buffer, to compare against
logical_work_mem) and per transaction (so that we can quickly pick
transaction to spill to disk).

The one wrinkle on the patch is that the memory limit can't be enforced
when reading changes spilled to disk - with multiple subtransactions, we
can't easily predict how many changes to pre-read for each of them. At
that point we still use the existing max_changes_in_memory limit.

Luckily, changes introduced in the other parts of the patch should allow
addressing this deficiency.


PART 2: streaming of large in-progress transactions (0002-0006)
---------------------------------------------------------------

Note: This part is split into multiple smaller chunks, addressing
different parts of the logical decoding infrastructure. That's mostly to
allow easier reviews, though. Ultimately, it's just one patch.

Processing large transactions often results in significant apply lag,
for a couple of reasons. One reason is network bandwidth - while we do
decode the changes incrementally (as we read the WAL), we keep them
locally, either in memory, or spilled to files. Then at commit time, all
the changes get sent to the downstream (and applied) at the same time.
For large transactions the time to do the network transfer may be
significant, causing apply lag.

This patch extends the logical replication infrastructure (output plugin
API, reorder buffer, pgoutput, replication protocol etc.) so allow
streaming of in-progress transactions instead of spilling them to local
files.

The extensions to the API are pretty straightforward. Aside from adding
methods to stream changes/messages and commit a streamed transaction,
the API needs a function to abort a streamed (sub)transaction, and
functions to demarcate a block of streamed changes.

To decode a transaction, we need to know all it's subtransactions, and
invalidations. Currently, those are only known at commit time (although
some assignments may be known earlier), but invalidations are only ever
written in the commit record.

So far that was fine, because we only decode/replay transactions at
commit time, when all of this is known (because it's either in commit
record, or written before it).

But for in-progress transactions (i.e. the subject of interest here),
that is not the case. So the patch modifies WAL-logging to ensure those
two bits of information are written immediately (for wal_level=logical).

For assignments that was fairly simple, thanks to existing caching. For
invalidations, it requires a new WAL record type and a couple of changes
in inval.c.

On the apply side, we simply receive the streamed changes, write them
into a file (one file for toplevel transaction, which is possible thanks
to the assignments being known immediately). And then at commit time the
changes are replayed locally, without having to copy a large chunk of
data over network.


WAL overhead
------------

Of course, these changes to WAL logging are not for free - logging
assignments individually (instead of multiple subtransactions at once)
means higher xlog record overhead. Similarly, (sub)transactions doing a
lot of DDL may result in a lot of invalidations written to WAL (again,
with full xlog record overhead per invalidation).

I've done a number of tests to measure the impact, and for extreme
corner cases the additional amount of WAL is about 40% in both cases.

By an "extreme corner case" I mean a workloads intentionally triggering
many assignments/invalidations, without doing a lot of meaningful work.

For assignments, imagine a single-row table (no indexes), and a
transaction like this one:

    BEGIN;
    UPDATE t SET v = v + 1;
    SAVEPOINT s1;
    UPDATE t SET v = v + 1;
    SAVEPOINT s2;
    UPDATE t SET v = v + 1;
    SAVEPOINT s3;
    ...
    UPDATE t SET v = v + 1;
    SAVEPOINT s10;
    UPDATE t SET v = v + 1;
    COMMIT;

For invalidations, add a CREATE TEMPORARY TABLE to each subtransaction.

For more realistic workloads (large table with indexes, runs long enough
to generate FPIs, etc.) the overhead drops below 5%. Which is much more
acceptable, of course, although not perfect.

In both cases, there was pretty much no measurable impact on performance
(as measured by tps).

I do not think there's a way around this requirement (having assignments
and invalidations), if we want to decode in-progress transactions. But
perhaps it would be possible to do some sort of caching (say, at command
level), to reduce the xlog record overhead? Not sure.

All ideas are welcome, of course. In the worst case, I think we can add
a GUC enabling this additional logging - when disabled, streaming of
in-progress transactions would not be possible.


Simplifying ReorderBuffer
-------------------------

One interesting consequence of having assignments is that we could get
rid of the ReorderBuffer iterator, used to merge changes from subxacts.
The assignments allow us to keep changes for each toplevel transaction
in a single list, in LSN order, and just walk it. Abort can be performed
by remembering position of the first change in each subxact, and just
discarding the tail.

This is what the apply worker does with the streamed changes and aborts.

It would also allow us to enforce the memory limit while restoring
transactions spilled to disk, because we would not have the problem with
restoring changes for many subtransactions.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Erikjan Rijkers

Дата:

23 декабря 2017 г., 20:03:02

On 2017-12-23 05:57, Tomas Vondra wrote:
> Hi all,
> 
> Attached is a patch series that implements two features to the logical
> replication - ability to define a memory limit for the reorderbuffer
> (responsible for building the decoded transactions), and ability to
> stream large in-progress transactions (exceeding the memory limit).
> 

logical replication of 2 instances is OK but 3 and up fail with:

TRAP: FailedAssertion("!(last_lsn < change->lsn)", File: 
"reorderbuffer.c", Line: 1773)

I can cobble up a script but I hope you have enough from the assertion 
to see what's going wrong...

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

24 декабря 2017 г., 02:06:05

On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
> On 2017-12-23 05:57, Tomas Vondra wrote:
>> Hi all,
>>
>> Attached is a patch series that implements two features to the logical
>> replication - ability to define a memory limit for the reorderbuffer
>> (responsible for building the decoded transactions), and ability to
>> stream large in-progress transactions (exceeding the memory limit).
>>
> 
> logical replication of 2 instances is OK but 3 and up fail with:
> 
> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
> "reorderbuffer.c", Line: 1773)
> 
> I can cobble up a script but I hope you have enough from the assertion
> to see what's going wrong...

The assertion says that the iterator produces changes in order that does
not correlate with LSN. But I have a hard time understanding how that
could happen, particularly because according to the line number this
happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.

So instructions to reproduce the issue would be very helpful.

Attached is v2 of the patch series, fixing two bugs I discovered today.
I don't think any of these is related to your issue, though.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Erik Rijkers

Дата:

24 декабря 2017 г., 04:23:57

On 2017-12-23 21:06, Tomas Vondra wrote:
> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
>> On 2017-12-23 05:57, Tomas Vondra wrote:
>>> Hi all,
>>> 
>>> Attached is a patch series that implements two features to the 
>>> logical
>>> replication - ability to define a memory limit for the reorderbuffer
>>> (responsible for building the decoded transactions), and ability to
>>> stream large in-progress transactions (exceeding the memory limit).
>>> 
>> 
>> logical replication of 2 instances is OK but 3 and up fail with:
>> 
>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>> "reorderbuffer.c", Line: 1773)
>> 
>> I can cobble up a script but I hope you have enough from the assertion
>> to see what's going wrong...
> 
> The assertion says that the iterator produces changes in order that 
> does
> not correlate with LSN. But I have a hard time understanding how that
> could happen, particularly because according to the line number this
> happens in ReorderBufferCommit(), i.e. the current (non-streaming) 
> case.
> 
> So instructions to reproduce the issue would be very helpful.

Using:

0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
0006-Add-support-for-streaming-to-built-in-replication-v2.patch

As you expected the problem is the same with these new patches.

I have now tested more, and seen that it not always fails.  I guess that 
it here fails 3 times out of 4.  But the laptop I'm using at the moment 
is old and slow -- it may well be a factor as we've seen before [1].

Attached is the bash that I put together.  I tested with 
NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails 
often.  This same program run with HEAD never seems to fail (I tried a 
few dozen times).

thanks,

Erik Rijkers


[1] 
https://www.postgresql.org/message-id/3897361c7010c4ac03f358173adbcd60%40xs4all.nl

Вложения

test.sh

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

24 декабря 2017 г., 06:42:01


On 12/23/2017 11:23 PM, Erik Rijkers wrote:
> On 2017-12-23 21:06, Tomas Vondra wrote:
>> On 12/23/2017 03:03 PM, Erikjan Rijkers wrote:
>>> On 2017-12-23 05:57, Tomas Vondra wrote:
>>>> Hi all,
>>>>
>>>> Attached is a patch series that implements two features to the logical
>>>> replication - ability to define a memory limit for the reorderbuffer
>>>> (responsible for building the decoded transactions), and ability to
>>>> stream large in-progress transactions (exceeding the memory limit).
>>>>
>>>
>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>
>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>> "reorderbuffer.c", Line: 1773)
>>>
>>> I can cobble up a script but I hope you have enough from the assertion
>>> to see what's going wrong...
>>
>> The assertion says that the iterator produces changes in order that does
>> not correlate with LSN. But I have a hard time understanding how that
>> could happen, particularly because according to the line number this
>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) case.
>>
>> So instructions to reproduce the issue would be very helpful.
> 
> Using:
> 
> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
> 
> As you expected the problem is the same with these new patches.
> 
> I have now tested more, and seen that it not always fails.  I guess that
> it here fails 3 times out of 4.  But the laptop I'm using at the moment
> is old and slow -- it may well be a factor as we've seen before [1].
> 
> Attached is the bash that I put together.  I tested with
> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
> often.  This same program run with HEAD never seems to fail (I tried a
> few dozen times).
> 

Thanks. Unfortunately I still can't reproduce the issue. I even tried
running it in valgrind, to see if there are some memory access issues
(which should also slow it down significantly).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Craig Ringer

Дата:

24 декабря 2017 г., 10:51:52

On 23 December 2017 at 12:57, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.

PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

Also, even without subtransactions, we assemble a ReorderBufferTXN per transaction. Since transactions usually occur concurrently, systems with many concurrent txns can face lots of memory use.

We can't exclude tables that won't actually be replicated at the reorder buffering phase either. So txns use memory whether or not they do anything interesting as far as a given logical decoding session is concerned. Even if we'll throw all the data away we must buffer and assemble it first so we can make that decision.

Because logical decoding considers snapshots and cid increments even from other DBs (at least when the txn makes catalog changes) the memory use can get BIG too. I was recently working with a system that had accumulated 2GB of snapshots ... on each slot. With 7 slots, one for each DB.

So there's lots of room for difficulty with unpredictable memory use.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer

Does this consider the (currently high, IIRC) overhead of tracking serialized changes?

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Erik Rijkers

Дата:

24 декабря 2017 г., 15:00:00

>>>> 
>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>> 
>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>> "reorderbuffer.c", Line: 1773)
>>>> 
>>>> I can cobble up a script but I hope you have enough from the 
>>>> assertion
>>>> to see what's going wrong...
>>> 
>>> The assertion says that the iterator produces changes in order that 
>>> does
>>> not correlate with LSN. But I have a hard time understanding how that
>>> could happen, particularly because according to the line number this
>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming) 
>>> case.
>>> 
>>> So instructions to reproduce the issue would be very helpful.
>> 
>> Using:
>> 
>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>> 
>> As you expected the problem is the same with these new patches.
>> 
>> I have now tested more, and seen that it not always fails.  I guess 
>> that
>> it here fails 3 times out of 4.  But the laptop I'm using at the 
>> moment
>> is old and slow -- it may well be a factor as we've seen before [1].
>> 
>> Attached is the bash that I put together.  I tested with
>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which 
>> fails
>> often.  This same program run with HEAD never seems to fail (I tried a
>> few dozen times).
>> 
> 
> Thanks. Unfortunately I still can't reproduce the issue. I even tried
> running it in valgrind, to see if there are some memory access issues
> (which should also slow it down significantly).

One wonders again if 2ndquadrant shouldn't invest in some old hardware 
;)

Another Good Thing would be if there was a provision in the buildfarm to 
test patches like these.

But I'm probably not to first one to suggest that; no doubt it'll be 
possible someday.  In the meantime I'll try to repeat this crash on 
other machines (but that will be after the holidays).


Erik Rijkers

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

24 декабря 2017 г., 19:43:49


On 12/24/2017 05:51 AM, Craig Ringer wrote:
> On 23 December 2017 at 12:57, Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>     Hi all,
> 
>     Attached is a patch series that implements two features to the logical
>     replication - ability to define a memory limit for the reorderbuffer
>     (responsible for building the decoded transactions), and ability to
>     stream large in-progress transactions (exceeding the memory limit).
> 
>     I'm submitting those two changes together, because one builds on the
>     other, and it's beneficial to discuss them together.
> 
> 
>     PART 1: adding logical_work_mem memory limit (0001)
>     ---------------------------------------------------
> 
>     Currently, limiting the amount of memory consumed by logical decoding is
>     tricky (or you might say impossible) for several reasons:
> 
>     * The value is hard-coded, so it's not quite possible to customize it.
> 
>     * The amount of decoded changes to keep in memory is restricted by
>     number of changes. It's not very unclear how this relates to memory
>     consumption, as the change size depends on table structure, etc.
> 
>     * The number is "per (sub)transaction", so a transaction with many
>     subtransactions may easily consume significant amount of memory without
>     actually hitting the limit.
> 
> 
> Also, even without subtransactions, we assemble a ReorderBufferTXN
> per transaction. Since transactions usually occur concurrently,
> systems with many concurrent txns can face lots of memory use.
> 

I don't see how that could be a problem, considering the number of
toplevel transactions is rather limited (to max_connections or so).

> We can't exclude tables that won't actually be replicated at the reorder
> buffering phase either. So txns use memory whether or not they do
> anything interesting as far as a given logical decoding session is
> concerned. Even if we'll throw all the data away we must buffer and
> assemble it first so we can make that decision.

Yep.

> Because logical decoding considers snapshots and cid increments even
> from other DBs (at least when the txn makes catalog changes) the memory
> use can get BIG too. I was recently working with a system that had
> accumulated 2GB of snapshots ... on each slot. With 7 slots, one for
> each DB.
> 
> So there's lots of room for difficulty with unpredictable memory use.
> 

Yep.

>     So the patch does two things. Firstly, it introduces logical_work_mem, a
>     GUC restricting memory consumed by all transactions currently kept in
>     the reorder buffer
> 
> 
> Does this consider the (currently high, IIRC) overhead of tracking
> serialized changes?
>  

Consider in what sense?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

25 декабря 2017 г., 23:40:56

On 12/24/2017 10:00 AM, Erik Rijkers wrote:
>>>>>
>>>>> logical replication of 2 instances is OK but 3 and up fail with:
>>>>>
>>>>> TRAP: FailedAssertion("!(last_lsn < change->lsn)", File:
>>>>> "reorderbuffer.c", Line: 1773)
>>>>>
>>>>> I can cobble up a script but I hope you have enough from the assertion
>>>>> to see what's going wrong...
>>>>
>>>> The assertion says that the iterator produces changes in order that
>>>> does
>>>> not correlate with LSN. But I have a hard time understanding how that
>>>> could happen, particularly because according to the line number this
>>>> happens in ReorderBufferCommit(), i.e. the current (non-streaming)
>>>> case.
>>>>
>>>> So instructions to reproduce the issue would be very helpful.
>>>
>>> Using:
>>>
>>> 0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-v2.patch
>>> 0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical-v2.patch
>>> 0003-Issue-individual-invalidations-with-wal_level-log-v2.patch
>>> 0004-Extend-the-output-plugin-API-with-stream-methods-v2.patch
>>> 0005-Implement-streaming-mode-in-ReorderBuffer-v2.patch
>>> 0006-Add-support-for-streaming-to-built-in-replication-v2.patch
>>>
>>> As you expected the problem is the same with these new patches.
>>>
>>> I have now tested more, and seen that it not always fails.  I guess that
>>> it here fails 3 times out of 4.  But the laptop I'm using at the moment
>>> is old and slow -- it may well be a factor as we've seen before [1].
>>>
>>> Attached is the bash that I put together.  I tested with
>>> NUM_INSTANCES=2, which yields success, and NUM_INSTANCES=3, which fails
>>> often.  This same program run with HEAD never seems to fail (I tried a
>>> few dozen times).
>>>
>>
>> Thanks. Unfortunately I still can't reproduce the issue. I even tried
>> running it in valgrind, to see if there are some memory access issues
>> (which should also slow it down significantly).
> 
> One wonders again if 2ndquadrant shouldn't invest in some old hardware ;)
> 

Well, I've done tests on various machines, including some really slow
ones, and I still haven't managed to reproduce the failures using your
script. So I don't think that would really help. But I have reproduced
it by using a custom stress test script.

Turns out the asserts are overly strict - instead of

  Assert(prev_lsn < current_lsn);

it should have been

  Assert(prev_lsn <= current_lsn);

because some XLOG records may contain multiple rows (e.g. MULTI_INSERT).

The attached v3 fixes this issue, and also a couple of other thinkos:

1) The AssertChangeLsnOrder assert check was somewhat broken.

2) We've been sending aborts for all subtransactions, even those not yet
streamed. So downstream got confused and fell over because of an assert.

3) The streamed transactions were written to /tmp, using filenames using
subscription OID and XID of the toplevel transaction. That's fine, as
long as there's just a single replica running - if there are more, the
filenames will clash, causing really strange failures. So move the files
to base/pgsql_tmp where regular temporary files are written. I'm not
claiming this is perfect, perhaps we need to invent another location.

FWIW I believe the relation sync cache is somewhat broken by the
streaming. I thought resetting it would be good enough, but it's more
complicated (and trickier) than that. I'm aware of it, and I'll look
into that next - but probably not before 2018.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Erik Rijkers

Дата:

26 декабря 2017 г., 04:08:43

That indeed fixed the problem: running that same pgbench test, I see no 
crashes anymore (on any of 3 different machines, and with several 
pgbench parameters).

Thank you,

Erik Rijkers

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dmitry Dolgov

Дата:

26 декабря 2017 г., 20:50:45

> On 25 December 2017 at 18:40, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> The attached v3 fixes this issue, and also a couple of other thinkos

Thank you for the patch, it looks quite interesting. After a quick look at it

(mostly the first one so far, but I'm going to continue) I have a few questions:

> + * XXX With many subtransactions this might be quite slow, because we'll have

> + * to walk through all of them. There are some options how we could improve

> + * that: (a) maintain some secondary structure with transactions sorted by

> + * amount of changes, (b) not looking for the entirely largest transaction,

> + * but e.g. for transaction using at least some fraction of the memory limit,

> + * and (c) evicting multiple transactions at once, e.g. to free a given portion

> + * of the memory limit (e.g. 50%).

Do you want to address these possible alternatives somehow in this patch or you

want to left it outside? Maybe it makes sense to apply some combination of

them, e.g. maintain a secondary structure with relatively large transactions,

and then start evicting them. If it's somehow not enough, then start to evict

multiple transactions at once (option "c").

> + /*

> + * We clamp manually-set values to at least 64kB. The maintenance_work_mem

> + * uses a higher minimum value (1MB), so this is OK.

> + */

> + if (*newval < 64)

> + *newval = 64;

> +

I'm not sure what's recommended practice here, but maybe it makes sense to

have a warning here about changing this value to 64kB? Otherwise it can be

unexpected.

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Peter Eisentraut

Дата:

02 января 2018 г., 21:07:54

On 12/22/17 23:57, Tomas Vondra wrote:
> PART 1: adding logical_work_mem memory limit (0001)
> ---------------------------------------------------

The documentation in this patch contains some references to later
features (streaming).  Perhaps that could be separated so that the
patches can be applied independently.

I don't see the need to tie this setting to maintenance_work_mem.
maintenance_work_mem is often set to very large values, which could then
have undesirable side effects on this use.

Moreover, the name logical_work_mem makes it sound like it's a logical
version of work_mem.  Maybe we could think of another name.

I think we need a way to report on how much memory is actually used, so
the setting can be tuned.  Something analogous to log_temp_files perhaps.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

04 января 2018 г., 01:53:59

On 01/02/2018 04:07 PM, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
> 
> The documentation in this patch contains some references to later
> features (streaming).  Perhaps that could be separated so that the
> patches can be applied independently.
> 

Yeah, that's probably a good idea. But now that you mention it, I wonder
if "streaming" is really a good term. We already use it for "streaming
replication" and it may be quite confusing to use it for another feature
(particularly when it's streaming within logical streaming replication).

But I can't really think of a better name ...

> I don't see the need to tie this setting to maintenance_work_mem. 
> maintenance_work_mem is often set to very large values, which could
> then have undesirable side effects on this use.
> 

Well, we need to pick some default value, and we can either use a fixed
value (not sure what would be a good default) or tie it to an existing
GUC. We only really have work_mem and maintenance_work_mem, and the
walsender process will never use more than one such buffer. Which seems
to be closer to maintenance_work_mem.

Pretty much any default value can have undesirable side effects.

> Moreover, the name logical_work_mem makes it sound like it's a logical
> version of work_mem.  Maybe we could think of another name.
> 

I won't object to a better name, of course. Any proposals?

> I think we need a way to report on how much memory is actually used,
> so the setting can be tuned. Something analogous to log_temp_files
> perhaps.
> 

Yes, I agree. I'm just about to submit an updated version of the patch
series, that also introduces a few columns pg_stat_replication, tracking
this type of stats (amount of data spilled to disk or streamed, etc.).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

04 января 2018 г., 02:06:59

Hi,

attached is v4 of the patch series, with a couple of changes:

1) Fixes a bunch of bugs I discovered during stress testing.

I'm not going to go into details, but the main fixes are related to
properly updating progress from the worker, and not streaming when
creating the logical replication slot.

2) Introduces columns into pg_stat_replication.

The new columns track various kinds of statistics (number of xacts,
bytes, ...) about spill-to-disk/streaming. This will be useful when
tuning the GUC memory limit.

3) Two temporary bugfixes that make the patch series work.

The first one (0008) makes sure is_known_subxact is set properly for all
subtransactions, and there's a separate fix in the CF. So this will
eventually go away.

The second one (0009) fixes an issue that is specific to streaming. It
does fix the issue, but I need a bit more time to think about it before
merging it into 0005.

This does pass extensive stress testing with a workload mixing DML, DDL,
subtransactions, aborts, etc. under valgrind. I'm working on extending
the test coverage, and introducing various error conditions (e.g.
walsender/walreceiver timeouts, failures on both ends, etc.).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attached is v5, fixing a silly bug in part 0006, causing segfault when
creating a subscription.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

20 января 2018 г., 04:08:02

On 01/19/2018 03:34 PM, Tomas Vondra wrote:
> Attached is v5, fixing a silly bug in part 0006, causing segfault when
> creating a subscription.
> 

Meh, there was a bug in the sgml docs (<variable> vs. <varname>),
causing another failure. Hopefully v6 will pass the CI build, it does
pass a build with the same parameters on my system.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi there,

attached is an updated patch fixing all the reported issues (a bit more
about those below).

The main change in this patch version is reworked logging of subxact
assignments, which needs to be done immediately for incremental decoding
to work properly.

The previous patch versions did that by logging a separate xlog record,
which however had rather noticeable space overhead (~40% on a worst-case
test - tiny table, no FPWs, ...). While in practice the overhead would
be much closer to 0%, it still seemed unacceptable.

Andres proposed doing something like we do with replication origins in
XLogRecordAssemble, i.e. inventing a special block, and embedding the
assignment info into that (in the next xlog record). This turned out to
be working quite well, and the worst-case space overhead dropped to ~5%.

I have attempted to do something like that with the invalidations, which
is the other thing that needs to be logged immediately for incremental
decoding to work correctly. The plan was to use the same approach as for
assignments, i.e. embed the invalidations into the next xlog record and
stop sending them in the commit message. That however turned out to be
much more complicated - the embedding is fairly trivial, of course, but
unlike assignments the invalidations are needed for hot standbys. If we
only send them incrementally, I think the standby would have to collect
from the WAL records, and store them in a way that survives restarts.

So for invalidations the patch uses the original approach with a new
type xlog record type (ignored by standby), and still logging the
invalidations in commit record (which is that the standby relies on).

On 02/01/2018 11:50 PM, Tomas Vondra wrote:
> On 01/31/2018 07:53 AM, Masahiko Sawada wrote:
> ...
>> ----
>> CREATE SUBSCRIPTION commands accept work_mem < 64 but it leads ERROR
>> on publisher side when starting replication. Probably we should check
>> the value on the subscriber side as well.
>>

Added.

>> ----
>> When streaming = on, if we drop subscription in the middle of
>> receiving stream changes, DROP SUBSCRIPTION could leak tmp files
>> (.chages file and .subxacts file). Also it also happens when a
>> transaction on upstream aborted without abort record.
>>

Right. The files would get cleaned up eventually during restart (just
like other temporary files), but leaking them after DROP SUBSCRIPTION is
not cool. So I've added a simple tracking of files (or rather streamed
XIDs) in the worker, and clean them explicitly on exit.

>> ----
>> Since we can change both streaming option and work_mem option by ALTER
>> SUBSCRIPTION, documentation of ALTER SUBSCRIPTION needs to be updated.
>>

Yep, I've added note that work_mem and streaming can also be changed.
Those changes won't be applied to the already running worker, though.

>> ----
>> If we create a subscription without any options, both
>> pg_subscription.substream and pg_subscription.subworkmem are set to
>> null. However, since GetSubscription are't aware of NULL we start the
>> replication with invalid options like follows.
>> LOG:  received replication command: START_REPLICATION SLOT "hoge_sub"
>> LOGICAL 0/0 (proto_version '2', work_mem '893219954', streaming 'on',
>> publication_names '"hoge_pub"')
>>
>> I think we can set substream to false and subworkmem to -1 instead of
>> null, and then makes libpqrcv_startstreaming not set streaming option
>> if stream is -1.
>>

Good catch! I've done pretty much what you suggested here, i.e. store
-1/false instead and then handle that in libpqrcv_startstreaming.

>> ----
>> Some WARNING messages appeared. Maybe these are for debug purpose?
>>
>> WARNING:  updating stream stats 0x1c12ef8 4 3 65604
>> WARNING: UpdateSpillStats: updating stats 0x1c12ef8 0 0 0 39 41 2632080
>>

Yeah, those should be removed.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On 2018-03-03 01:55, Tomas Vondra wrote:
> Hi there,
> 
> attached is an updated patch fixing all the reported issues (a bit more
> about those below).

Hi,

0007-Track-statistics-for-streaming-spilling.patch  won't apply.  All 
the other patches apply ok.

patch complaints with:

patching file doc/src/sgml/monitoring.sgml
patching file src/backend/catalog/system_views.sql
Hunk #1 succeeded at 734 (offset 2 lines).
patching file src/backend/replication/logical/reorderbuffer.c
patching file src/backend/replication/walsender.c
patching file src/include/catalog/pg_proc.h
Hunk #1 FAILED at 2903.
1 out of 1 hunk FAILED -- saving rejects to file 
src/include/catalog/pg_proc.h.rej
patching file src/include/replication/reorderbuffer.h
patching file src/include/replication/walsender_private.h
patching file src/test/regress/expected/rules.out
Hunk #1 succeeded at 1861 (offset 2 lines).

Attached the produced reject file.

thanks,

Erik Rijkers

Вложения

pg_proc.h.rej

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

03 марта 2018 г., 20:52:40


On 03/03/2018 06:19 AM, Erik Rijkers wrote:
> On 2018-03-03 01:55, Tomas Vondra wrote:
>> Hi there,
>>
>> attached is an updated patch fixing all the reported issues (a bit more
>> about those below).
> 
> Hi,
> 
> 0007-Track-statistics-for-streaming-spilling.patch  won't apply.  All
> the other patches apply ok.
> 
> patch complaints with:
> 
> patching file doc/src/sgml/monitoring.sgml
> patching file src/backend/catalog/system_views.sql
> Hunk #1 succeeded at 734 (offset 2 lines).
> patching file src/backend/replication/logical/reorderbuffer.c
> patching file src/backend/replication/walsender.c
> patching file src/include/catalog/pg_proc.h
> Hunk #1 FAILED at 2903.
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/include/catalog/pg_proc.h.rej
> patching file src/include/replication/reorderbuffer.h
> patching file src/include/replication/walsender_private.h
> patching file src/test/regress/expected/rules.out
> Hunk #1 succeeded at 1861 (offset 2 lines).
> 
> Attached the produced reject file.
> 

Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
Attached is a rebased patch, fixing this.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Peter Eisentraut

Дата:

09 марта 2018 г., 22:07:55

I think this patch is not going to be ready for PG11.

- It depends on some work in the thread "logical decoding of two-phase
transactions", which is still in progress.

- Various details in the logical_work_mem patch (0001) are unresolved.

- This being partially a performance feature, we haven't seen any
performance tests (e.g., which settings result in which latencies under
which workloads).

That said, the feature seems useful and desirable, and the
implementation makes sense.  There are documentation and tests.  But
there is a significant amount of design and coding work still necessary.

Attached is a fixup patch that I needed to make it compile.

The last two patches in your series (0008, 0009) are labeled as bug
fixes.  Would you like to argue that they should be applied
independently of the rest of the feature?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

0001-fixup-Track-statistics-for-streaming-spilling.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Konstantin Knizhnik

Дата:

29 марта 2018 г., 22:34:58

On 11.01.2018 22:41, Peter Eisentraut wrote:
> On 12/22/17 23:57, Tomas Vondra wrote:
>> PART 1: adding logical_work_mem memory limit (0001)
>> ---------------------------------------------------
>>
>> Currently, limiting the amount of memory consumed by logical decoding is
>> tricky (or you might say impossible) for several reasons:
> I would like to see some more discussion on this, but I think not a lot
> of people understand the details, so I'll try to write up an explanation
> here.  This code is also somewhat new to me, so please correct me if
> there are inaccuracies, while keeping in mind that I'm trying to simplify.
>
> The data in the WAL is written as it happens, so the changes belonging
> to different transactions are all mixed together.  One of the jobs of
> logical decoding is to reassemble the changes belonging to each
> transaction.  The top-level data structure for that is the infamous
> ReorderBuffer.  So as it reads the WAL and sees something about a
> transaction, it keeps a copy of that change in memory, indexed by
> transaction ID (ReorderBufferChange).  When the transaction commits, the
> accumulated changes are passed to the output plugin and then freed.  If
> the transaction aborts, then changes are just thrown away.
>
> So when logical decoding is active, a copy of the changes for each
> active transaction is kept in memory (once per walsender).
>
> More precisely, the above happens for each subtransaction.  When the
> top-level transaction commits, it finds all its subtransactions in the
> ReorderBuffer, reassembles everything in the right order, then invokes
> the output plugin.
>
> All this could end up using an unbounded amount of memory, so there is a
> mechanism to spill changes to disk.  The way this currently works is
> hardcoded, and this patch proposes to change that.
>
> Currently, when a transaction or subtransaction has accumulated 4096
> changes, it is spilled to disk.  When the top-level transaction commits,
> things are read back from disk to do the final processing mentioned above.
>
> This all works mostly fine, but you can construct some more extreme
> cases where this can blow up.
>
> Here is a mundane example.  Let's say a change entry takes 100 bytes (it
> might contain a new row, or an update key and some new column values,
> for example).  If you have 100 concurrent active sessions and no
> subtransactions, then logical decoding memory is bounded by 4096 * 100 *
> 100 = 40 MB (per walsender) before things spill to disk.
>
> Now let's say you are using a lot of subtransactions, because you are
> using PL functions, exception handling, triggers, doing batch updates.
> If you have 200 subtransactions on average per concurrent session, the
> memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
> (per walsender).  And so on.  If you have more concurrent sessions or
> larger changes or more subtransactions, you'll use much more than those
> 8 GB.  And if you don't have those 8 GB, then you're stuck at this point.
>
> That is the consideration when we record changes, but we also need
> memory when we do the final processing at commit time.  That is slightly
> less problematic because we only process one top-level transaction at a
> time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
> (without the concurrent sessions factor).
>
> So, this patch proposes to improve this as follows:
>
> - We compute the actual size of each ReorderBufferChange and keep a
> running tally for each transaction, instead of just counting the number
> of changes.
>
> - We have a configuration setting that allows us to change the limit
> instead of the hardcoded 4096.  The configuration setting is also in
> terms of memory, not in number of changes.
>
> - The configuration setting is for the total memory usage per decoding
> session, not per subtransaction.  (So we also keep a running tally for
> the entire ReorderBuffer.)
>
> There are two open issues with this patch:
>
> One, this mechanism only applies when recording changes.  The processing
> at commit time still uses the previous hardcoded mechanism.  The reason
> for this is, AFAIU, that as things currently work, you have to have all
> subtransactions in memory to do the final processing.  There are some
> proposals to change this as well, but they are more involved.  Arguably,
> per my explanation above, memory use at commit time is less likely to be
> a problem.
>
> Two, what to do when the memory limit is reached.  With the old
> accounting, this was easy, because we'd decide for each subtransaction
> independently whether to spill it to disk, when it has reached its 4096
> limit.  Now, we are looking at a global limit, so we have to find a
> transaction to spill in some other way.  The proposed patch searches
> through the entire list of transactions to find the largest one.  But as
> the patch says:
>
> "XXX With many subtransactions this might be quite slow, because we'll
> have to walk through all of them. There are some options how we could
> improve that: (a) maintain some secondary structure with transactions
> sorted by amount of changes, (b) not looking for the entirely largest
> transaction, but e.g. for transaction using at least some fraction of
> the memory limit, and (c) evicting multiple transactions at once, e.g.
> to free a given portion of the memory limit (e.g. 50%)."
>
> (a) would create more overhead for the case where everything fits into
> memory, so it seems unattractive.  Some combination of (b) and (c) seems
> useful, but we'd have to come up with something concrete.
>
> Thoughts?
>

I am very sorry that I have not noticed this thread before.
Spilling to the file in reorder buffer is the main factor limiting speed 
of importing data in multimaster and shardman (sharding based on FDW 
with redundancy provided by LR).
This is why we think a lot about possible ways of addressing this issue.
Right now data of huge transaction is written to the disk three times 
before it is applied at replica. And obviously read also three times.
First it is saved in WAL, then spilled to the disk by reorder buffer and 
once again spilled to the disk at replica before assignment to the 
particular apply worker
(last one is specific of multimaster, which can apply received 
transactions concurrently).

We considered three different approaches:
1. Streaming. It is similar with the proposed patch, the main difference 
is that we do not want to spill transaction in temporary file at 
replica, but apply it immediately in separate backend and abort 
transaction if it is aborted at master. Certainly it will work only with 
2PC.
2. Elimination of spilling by rescanning WAL.
3. Bypass WAL: add hooks to heapam to buffer and propagate changes 
immediately to replica and apply them in dedicated backend.
I have implemented prototype of such replication. With one replica it 
shows about 1.5x slowdown comparing with standalone/async LR and about 
2-3 improvement comparing with sync LR. For two replicas result is 2x 
slower than async LR and 2-8 times faster than sync LR (depending on 
number of concurrent connections).

Approach 3) seems to be specific to multimaster/shardman, so most likely 
it can not be considered for general LR.
So I want to compare 1 and 2. Did you ever though about something like 2?

Right now in the proposed patch you just move spilling to the file from 
master to replica.
It still can make sense to avoid memory overflow and reduce disk IO at 
master.
But if we have just one huge transaction (COPY) importing gigabytes of 
data to the database,
then performance will be almost the same with your patch or without it.
The only difference is where we serialize transaction: at master or at 
replica side.
In this sense this patch doesn't solve the problem with slow load of 
large bulks of data though LR.

Alternatively (approach 2) we can have small in-memory buffer for 
decoding transaction and remember LSN and snapshot of this transaction 
start.
In case of buffer overflow we just continue WAL traversal until we reach
end of the transaction. After it we restart scanning WAL from the 
beginning of this transaction at this second pass
send changes directly to the output plugin. So we have to scan WAL 
several times but do not need to spill anything to the disk neither at 
publisher, neither at subscriber side.
Certainly this approach will be inefficient if we have several long 
interleaving transactions. But in most customer's use cases we have 
observed until now there is just one huge transaction performing bulk load.
May be I missed something, but this approach seems to be easier for 
implementation than transaction streaming. And it doesn't require any 
changes in output plugin API.
I realize that it is a little bit late to ask this question once your 
patch is almost ready, but what do you think about it? Are there some 
pitfals with this approach?

There is one more aspect and performance problem with LR we have faced 
with shardman: if there are several publications for different subsets 
of table at one instance,
then WAL senders have to do a lot of useless work. Them are decoding 
transactions which have no relation to this publication. But WAL sender 
doesn't know it until it reaches the end of this transaction. What is 
worser: if transaction is huge, then all WAL senders will spill it to 
the disk even through only one of them actually needs it. So data of 
huge transaction is written not three times, but N times, where N is 
number of publications. The only solution of the problem we can imagine 
is to let backend somehow inform WAL sender (through shared message queue?)
about LSN-s it should considered. In this case WAL sender can skip large 
portions of WAL without decoding. We also want to know opinion of 
2ndQuandarnt about this idea.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Peter Eisentraut

Дата:

01 июля 2018 г., 17:43:50

This patch set was not updated for the 2018-07 commitfest, so moved to -09.


On 09.03.18 17:07, Peter Eisentraut wrote:
> I think this patch is not going to be ready for PG11.
> 
> - It depends on some work in the thread "logical decoding of two-phase
> transactions", which is still in progress.
> 
> - Various details in the logical_work_mem patch (0001) are unresolved.
> 
> - This being partially a performance feature, we haven't seen any
> performance tests (e.g., which settings result in which latencies under
> which workloads).
> 
> That said, the feature seems useful and desirable, and the
> implementation makes sense.  There are documentation and tests.  But
> there is a significant amount of design and coding work still necessary.
> 
> Attached is a fixup patch that I needed to make it compile.
> 
> The last two patches in your series (0008, 0009) are labeled as bug
> fixes.  Would you like to argue that they should be applied
> independently of the rest of the feature?
> 


-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Michael Paquier

Дата:

02 октября 2018 г., 10:59:02

On Sat, Mar 03, 2018 at 03:52:40PM +0100, Tomas Vondra wrote:
> Yeah, that's due to fd1a421fe66 which changed columns in pg_proc.h.
> Attached is a rebased patch, fixing this.

The latest patch set does not apply anymore, and had no activity for the
last two months, so I am marking it as returned with feedback.
--
Michael

Вложения

signature.asc

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

16 декабря 2018 г., 17:31:52

Hi,

Attached is an updated version of this patch series. It's meant to be
applied on top of the 2pc decoding patch [1], because streaming of
in-progress transactions requires handling of concurrent aborts. So it
may or may not apply directly to master, I'm not sure - unfortunately
that's likely to confuse the cputube thing, but I don't want to include
the 2pc decoding bits here because that would be just confusing.

If needed, the part introducing logical_work_mem limit for ReorderBuffer
can be separated and committed independently, but I do expect this to be
committed after the 2pc decoding patch so I've left it like this.

This new version is mostly just a rebase to current master (or almost,
because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
but it also addresses the new stuff committed since last version (most
importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
subxact assignments, where the assignment was included in records with
XID=0, essentially failing to track the subxact properly.

For the logical_work_mem part, I think this is quite solid. The main
question is how to pick transactions for eviction. For now it uses the
same approach as master (i.e. picking the largest top-level transaction,
although measured by amount of memory and not just number of changes).

But I've realized that may not work with Generation context that great,
because unlike AllocSet it does not reuse the memory. That's nice as it
allows freeing old blocks (which AllocSet can't), but it means a small
transaction can have a change on old blocks preventing free(). That is
something we have in pg11 already, because that's where Generation
context got introduced - I haven't seen this issue in practice, but we
might need to do something about it.

In any case, I'm thinking we may need to pick a different eviction
algorithm - say using a transaction with the oldest change (and loop
until we release at least one block in the Generation context), or maybe
look for block mixing changes from the smallest number of transactions,
or something like that. Other ideas are welcome. I don't think the exact
algorithm is particularly critical, because it's meant to be triggered
only very rarely (i.e. pick logical_work_mem high enough).

The in-progress streaming is mostly mechanical extension of existing
functionality (new methods in various APIs, ...) and refactoring of
ReorderBuffer to handle incremental decoding. I'm sure it'd benefit from
reviews, of course.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

16 декабря 2018 г., 20:54:47

FWIW the original CF entry in 2018-07 [1] was marked as RWF. I'm not
sure what's the right way to resubmit such patches, so I've created a
new entry in 2019-01 [2] referencing the same hackers thread (and with
the same authors/reviewers metadata).

[1] https://commitfest.postgresql.org/19/1429/
[2] https://commitfest.postgresql.org/21/1927/

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alexey Kondratov

Дата:

17 декабря 2018 г., 19:23:44

Hi Tomas,

> This new version is mostly just a rebase to current master (or almost,
> because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
> but it also addresses the new stuff committed since last version (most
> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
> subxact assignments, where the assignment was included in records with
> XID=0, essentially failing to track the subxact properly.

I started reviewing your patch about a month ago and tried to do an 
in-depth review, since I am very interested in this patch too. The new 
version is not applicable to master 29180e5d78, but everything is OK 
after applying 2pc patch before. Anyway, I guess it may complicate 
further testing and review, since any potential reviewer has to take 
into account both patches at once. Previous version was applicable to 
master and was working fine for me separately (excepting a few 
patch-specific issues, which I try to explain below).


Patch review
========

First of all, I want to say thank you for such a huge work done. Here 
are some problems, which I have found and hopefully fixed with my 
additional patch (please, find attached, it should be applicable to the 
last commit of your newest patch version):

1) The most important issue is that your tap tests were broken—there was 
missing option "WITH (streaming=true)" in the subscription creating 
statement. Therefore, spilling mechanism has been tested rather than 
streaming.

2) After fixing tests the first one with simple streaming is immediately 
failed, because of logical replication worker segmentation fault. It 
happens, since worker tries to call stream_cleanup_files inside 
stream_open_file at the stream start, while nxids is zero, then it goes 
to the negative value and everything crashes. Something similar may 
happen with xids array, so I added two checks there.

3) The next problem is much more critical and is dedicated to historic 
MVCC visibility rules. Previously, walsender was starting to decode 
transaction on commit and we were able to resolve all xmin, xmax, 
combocids to cmin/cmax, build tuplecids hash and so on, but now we start 
doing all these things on the fly.

Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC 
is trying to validate catalog tuples, which are currently in the future 
relatively to the current decoder position inside transaction, e.g. we 
may want to resolve cmin/cmax of a tuple, which was created with cid 3 
and deleted with cid 5, while we are currently at cid 4, so our 
tuplecids hash is not complete to handle such a case.

I have updated HeapTupleSatisfiesHistoricMVCC visibility rules with two 
options:

/*
  * If we accidentally see a tuple from our transaction, but cannot 
resolve its
  * cmin, so probably it is from the future, thus drop it.
  */
if (!resolved)
     return false;

and

/*
  * If we accidentally see a tuple from our transaction, but cannot 
resolve its
  * cmax or cmax == InvalidCommandId, so probably it is still valid, 
thus accept it.
  */
if (!resolved || cmax == InvalidCommandId)
     return true;

4) There was a problem with marking top-level transaction as having 
catalog changes if one of its subtransactions has. It was causing a 
problem with DDL statements just after subtransaction start (savepoint), 
so data from new columns is not replicated.

5) Similar issue with schema send. You send schema only once per each 
sub/transaction (IIRC), while we have to update schema on each catalog 
change: invalidation execution, snapshot rebuild, adding new tuple cids. 
So I ended up with adding is_schema_send flag to ReorderBufferTXN, since 
it is easy to set it inside RB and read in the output plugin. Probably, 
we have to choose a better place for this flag.

6) To better handle all these tricky cases I added new tap 
test—014_stream_tough_ddl.pl—which consist of really tough combination 
of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.

I marked all my fixes and every questionable place with comment and 
"TOCHECK:" label for easy search. Removing of pretty any of these fixes 
leads to the tests fail due to the segmentation fault or replication 
mismatch. Though I mostly read and tested old version of patch, but 
after a quick look it seems that all these fixes are applicable to the 
new version of patch as well.


Performance
========

I have also performed a series of performance tests, and found that 
patch adds a huge overhead in the case of a large transaction consisting 
of many small rows, e.g.:

CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double 
precision);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
SELECT round(random()*10), random(), random()*142
FROM generate_series(1, 1000000) s(i);

Execution Time: 2407.709 ms
Total Time: 11494,238 ms (00:11,494)

With synchronous_standby_names and 64 MB logical_work_mem it takes up to 
x5 longer, while without patch it is about x2. Thus, logical replication 
streaming is approximately x4 as slower for similar transactions.

However, dealing with large transactions consisting of a small number of 
large rows is much better:

CREATE TABLE large_text (t TEXT);

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 125);

Execution Time: 3545.642 ms
Total Time: 7678,617 ms (00:07,679)

It is around the same x2 as without patch. If someone is interested I 
also added flamegraphs of walsender and logical replication worker 
during first numerical transaction processing.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

18 декабря 2018 г., 01:28:06

Hi Alexey,

Thanks for the thorough and extremely valuable review!

On 12/17/18 5:23 PM, Alexey Kondratov wrote:
> Hi Tomas,
> 
>> This new version is mostly just a rebase to current master (or almost,
>> because 2pc decoding only applies to 29180e5d78 due to minor bitrot),
>> but it also addresses the new stuff committed since last version (most
>> importantly decoding of TRUNCATE). It also fixes a bug in WAL-logging of
>> subxact assignments, where the assignment was included in records with
>> XID=0, essentially failing to track the subxact properly.
> 
> I started reviewing your patch about a month ago and tried to do an
> in-depth review, since I am very interested in this patch too. The new
> version is not applicable to master 29180e5d78, but everything is OK
> after applying 2pc patch before. Anyway, I guess it may complicate
> further testing and review, since any potential reviewer has to take
> into account both patches at once. Previous version was applicable to
> master and was working fine for me separately (excepting a few
> patch-specific issues, which I try to explain below).
> 

I agree it's somewhat annoying, but I don't think there's a better way,
unfortunately. Decoding in-progress transactions does require safe
handling of concurrent aborts, so it has to be committed after the 2pc
decoding patch (which makes that possible). But the 2pc patch also
touches the same places as this patch series (it reworks the reorder
buffer for example).

> 
> Patch review
> ========
> 
> First of all, I want to say thank you for such a huge work done. Here
> are some problems, which I have found and hopefully fixed with my
> additional patch (please, find attached, it should be applicable to the
> last commit of your newest patch version):
> 
> 1) The most important issue is that your tap tests were broken—there was
> missing option "WITH (streaming=true)" in the subscription creating
> statement. Therefore, spilling mechanism has been tested rather than
> streaming.
> 

D'oh!

> 2) After fixing tests the first one with simple streaming is immediately
> failed, because of logical replication worker segmentation fault. It
> happens, since worker tries to call stream_cleanup_files inside
> stream_open_file at the stream start, while nxids is zero, then it goes
> to the negative value and everything crashes. Something similar may
> happen with xids array, so I added two checks there.
> 
> 3) The next problem is much more critical and is dedicated to historic
> MVCC visibility rules. Previously, walsender was starting to decode
> transaction on commit and we were able to resolve all xmin, xmax,
> combocids to cmin/cmax, build tuplecids hash and so on, but now we start
> doing all these things on the fly.
> 
> Thus, rather difficult situation arises: HeapTupleSatisfiesHistoricMVCC
> is trying to validate catalog tuples, which are currently in the future
> relatively to the current decoder position inside transaction, e.g. we
> may want to resolve cmin/cmax of a tuple, which was created with cid 3
> and deleted with cid 5, while we are currently at cid 4, so our
> tuplecids hash is not complete to handle such a case.
> 

Damn it! I ran into those two issues some time ago and I fixed it, but
I've forgotten to merge that fix into the patch. I'll merge those fixed
and compare them to your proposed fix, and send a new version tomorrow.

> 
> 4) There was a problem with marking top-level transaction as having
> catalog changes if one of its subtransactions has. It was causing a
> problem with DDL statements just after subtransaction start (savepoint),
> so data from new columns is not replicated.
> 
> 5) Similar issue with schema send. You send schema only once per each
> sub/transaction (IIRC), while we have to update schema on each catalog
> change: invalidation execution, snapshot rebuild, adding new tuple cids.
> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
> it is easy to set it inside RB and read in the output plugin. Probably,
> we have to choose a better place for this flag.
> 

Hmm. Can you share an example how to trigger these issues?

> 6) To better handle all these tricky cases I added new tap
> test—014_stream_tough_ddl.pl—which consist of really tough combination
> of DDL, DML, savepoints and ROLLBACK/RELEASE in a single transaction.
> 

Thanks!

> I marked all my fixes and every questionable place with comment and
> "TOCHECK:" label for easy search. Removing of pretty any of these fixes
> leads to the tests fail due to the segmentation fault or replication
> mismatch. Though I mostly read and tested old version of patch, but
> after a quick look it seems that all these fixes are applicable to the
> new version of patch as well.
> 

Thanks. I'll go through your patch tomorrow.

> 
> Performance
> ========
> 
> I have also performed a series of performance tests, and found that
> patch adds a huge overhead in the case of a large transaction consisting
> of many small rows, e.g.:
> 
> CREATE TABLE large_test (num1 bigint, num2 double precision, num3 double
> precision);
> 
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
> SELECT round(random()*10), random(), random()*142
> FROM generate_series(1, 1000000) s(i);
> 
> Execution Time: 2407.709 ms
> Total Time: 11494,238 ms (00:11,494)
> 
> With synchronous_standby_names and 64 MB logical_work_mem it takes up to
> x5 longer, while without patch it is about x2. Thus, logical replication
> streaming is approximately x4 as slower for similar transactions.
> 
> However, dealing with large transactions consisting of a small number of
> large rows is much better:
> 
> CREATE TABLE large_text (t TEXT);
> 
> EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 1000000)) FROM generate_series(1, 125);
> 
> Execution Time: 3545.642 ms
> Total Time: 7678,617 ms (00:07,679)
> 
> It is around the same x2 as without patch. If someone is interested I
> also added flamegraphs of walsender and logical replication worker
> during first numerical transaction processing.
> 

Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flame graph for
the "normal" case).

I'll investigate this (probably not this week), but in general it's good
to keep in mind a couple of things:

1) Some overhead is expected, due to doing things incrementally.

2) The memory limit should be set to sufficiently high value to be hit
only infrequently.

3) And when the limit is actually hit, it's an alternative to spilling
large amounts of data locally (to disk) or incurring significant
replication lag later.

So I'm not particularly worried, but I'll look into that. I'd be much
more worried if there was measurable overhead in cases when there's no
streaming happening (either because it's disabled or the memory limit
was not hit).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alexey Kondratov

Дата:

18 декабря 2018 г., 17:07:08

On 18.12.2018 1:28, Tomas Vondra wrote:
>> 4) There was a problem with marking top-level transaction as having
>> catalog changes if one of its subtransactions has. It was causing a
>> problem with DDL statements just after subtransaction start (savepoint),
>> so data from new columns is not replicated.
>>
>> 5) Similar issue with schema send. You send schema only once per each
>> sub/transaction (IIRC), while we have to update schema on each catalog
>> change: invalidation execution, snapshot rebuild, adding new tuple cids.
>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
>> it is easy to set it inside RB and read in the output plugin. Probably,
>> we have to choose a better place for this flag.
>>
> Hmm. Can you share an example how to trigger these issues?

Test cases inside 014_stream_tough_ddl.pl and old ones (with 
streaming=true option added) should reproduce all these issues. In 
general, it happens in a txn like:

INSERT
SAVEPOINT
ALTER TABLE ... ADD COLUMN
INSERT

then the second insert may discover old version of catalog.

> Interesting. Any idea where does the extra overhead in this particular
> case come from? It's hard to deduce that from the single flame graph,
> when I don't have anything to compare it with (i.e. the flame graph for
> the "normal" case).

I guess that bottleneck is in disk operations. You can check 
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and 
writes (~26%) take around 35% of CPU time in summary. To compare, 
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same 
~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.

Therefore, probably you may write changes on receiver in bigger chunks, 
not each change separately.

> So I'm not particularly worried, but I'll look into that. I'd be much
> more worried if there was measurable overhead in cases when there's no
> streaming happening (either because it's disabled or the memory limit
> was not hit).

What I have also just found, is that if a table row is large enough to 
be TOASTed, e.g.:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

then logical_work_mem limit is not hit and we neither stream, nor spill 
to disk this transaction, while it is still large. In contrast, the 
transaction above (with 1000000 smaller rows) being comparable in size 
is streamed. Not sure, that it is easy to add proper accounting of 
TOAST-able columns, but it worth it.

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Вложения

logical_repl_worker_text_stream_perf.zip

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

19 декабря 2018 г., 02:56:07

Hi Alexey,

Attached is an updated version of the patches, with all the fixes I've
done in the past. I believe it should fix at least some of the issues
you reported - certainly the problem with stream_cleanup_files, but
perhaps some of the other issues too.

I'm a bit confused by the changes to TAP tests. Per the patch summary,
some .pl files get renamed (nor sure why), a new one is added, etc. So
I've instead enabled streaming subscriptions in all tests, which with
this patch produces two failures:

Test Summary Report
-------------------
t/004_sync.pl                    (Wstat: 7424 Tests: 1 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 7 tests but ran 1.
t/011_stream_ddl.pl              (Wstat: 256 Tests: 2 Failed: 1)
  Failed test:  2
  Non-zero exit status: 1

So yeah, there's more stuff to fix. But I can't directly apply your
fixes because the updated patches are somewhat different.


On 12/18/18 3:07 PM, Alexey Kondratov wrote:
> On 18.12.2018 1:28, Tomas Vondra wrote:
>>> 4) There was a problem with marking top-level transaction as having
>>> catalog changes if one of its subtransactions has. It was causing a
>>> problem with DDL statements just after subtransaction start (savepoint),
>>> so data from new columns is not replicated.
>>>
>>> 5) Similar issue with schema send. You send schema only once per each
>>> sub/transaction (IIRC), while we have to update schema on each catalog
>>> change: invalidation execution, snapshot rebuild, adding new tuple cids.
>>> So I ended up with adding is_schema_send flag to ReorderBufferTXN, since
>>> it is easy to set it inside RB and read in the output plugin. Probably,
>>> we have to choose a better place for this flag.
>>>
>> Hmm. Can you share an example how to trigger these issues?
> 
> Test cases inside 014_stream_tough_ddl.pl and old ones (with
> streaming=true option added) should reproduce all these issues. In
> general, it happens in a txn like:
> 
> INSERT
> SAVEPOINT
> ALTER TABLE ... ADD COLUMN
> INSERT
> 
> then the second insert may discover old version of catalog.
> 

Yeah, that's the issue I've discovered before and thought it got fixed.

>> Interesting. Any idea where does the extra overhead in this particular
>> case come from? It's hard to deduce that from the single flame graph,
>> when I don't have anything to compare it with (i.e. the flame graph for
>> the "normal" case).
> 
> I guess that bottleneck is in disk operations. You can check
> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
> writes (~26%) take around 35% of CPU time in summary. To compare,
> please, see attached flame graph for the following transaction:
> 
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
> 
> Execution Time: 44519.816 ms
> Time: 98333,642 ms (01:38,334)
> 
> where disk IO is only ~7-8% in total. So we get very roughly the same
> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
> 
> Therefore, probably you may write changes on receiver in bigger chunks,
> not each change separately.
> 

Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?

>> So I'm not particularly worried, but I'll look into that. I'd be much
>> more worried if there was measurable overhead in cases when there's no
>> streaming happening (either because it's disabled or the memory limit
>> was not hit).
> 
> What I have also just found, is that if a table row is large enough to
> be TOASTed, e.g.:
> 
> INSERT INTO large_text
> SELECT (SELECT string_agg('x', ',')
> FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> 
> then logical_work_mem limit is not hit and we neither stream, nor spill
> to disk this transaction, while it is still large. In contrast, the
> transaction above (with 1000000 smaller rows) being comparable in size
> is streamed. Not sure, that it is easy to add proper accounting of
> TOAST-able columns, but it worth it.
> 

That's certainly strange and possibly a bug in the memory accounting
code. I'm not sure why would that happen, though, because TOAST data
look just like regular INSERT changes. Interesting. I wonder if it's
already fixed in this updated version, but it's a bit too late to
investigate that today.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alexey Kondratov

Дата:

19 декабря 2018 г., 12:58:58

Hi Tomas,

> I'm a bit confused by the changes to TAP tests. Per the patch summary,
> some .pl files get renamed (nor sure why), a new one is added, etc.

I added new tap test case, streaming=true option inside old stream_* 
ones and incremented streaming tests number (+2) because of the 
collision between 009_matviews.pl / 009_stream_simple.pl and 
010_truncate.pl / 010_stream_subxact.pl. At least in the previous 
version of the patch they were under the same numbers. Nothing special, 
but for simplicity, please, find attached my new tap test separately.

>   So
> I've instead enabled streaming subscriptions in all tests, which with
> this patch produces two failures:
>
> Test Summary Report
> -------------------
> t/004_sync.pl                    (Wstat: 7424 Tests: 1 Failed: 0)
>    Non-zero exit status: 29
>    Parse errors: Bad plan.  You planned 7 tests but ran 1.
> t/011_stream_ddl.pl              (Wstat: 256 Tests: 2 Failed: 1)
>    Failed test:  2
>    Non-zero exit status: 1
>
> So yeah, there's more stuff to fix. But I can't directly apply your
> fixes because the updated patches are somewhat different.

Fixes should apply clearly to the previous version of your patch. Also, 
I am not sure, that it is a good idea to simply enable streaming 
subscriptions in all tests (e.g. pre streaming patch t/004_sync.pl), 
since then they do not hit not streaming code.

>>> Interesting. Any idea where does the extra overhead in this particular
>>> case come from? It's hard to deduce that from the single flame graph,
>>> when I don't have anything to compare it with (i.e. the flame graph for
>>> the "normal" case).
>> I guess that bottleneck is in disk operations. You can check
>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>> writes (~26%) take around 35% of CPU time in summary. To compare,
>> please, see attached flame graph for the following transaction:
>>
>> INSERT INTO large_text
>> SELECT (SELECT string_agg('x', ',')
>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>
>> Execution Time: 44519.816 ms
>> Time: 98333,642 ms (01:38,334)
>>
>> where disk IO is only ~7-8% in total. So we get very roughly the same
>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for tests.
>>
>> Therefore, probably you may write changes on receiver in bigger chunks,
>> not each change separately.
>>
> Possibly, I/O is certainly a possible culprit, although we should be
> using buffered I/O and there certainly are not any fsyncs here. So I'm
> not sure why would it be cheaper to do the writes in batches.
>
> BTW does this mean you see the overhead on the apply side? Or are you
> running this on a single machine, and it's difficult to decide?

I run this on a single machine, but walsender and worker are utilizing 
almost 100% of CPU per each process all the time, and at apply side I/O 
syscalls take about 1/3 of CPU time. Though I am still not sure, but for 
me this result somehow links performance drop with problems at receiver 
side.

Writing in batches was just a hypothesis and to validate it I have 
performed test with large txn, but consisting of a smaller number of 
wide rows. This test does not exhibit any significant performance drop, 
while it was streamed too. So it seems to be valid. Anyway, I do not 
have other reasonable ideas beside that right now.


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Вложения

0xx_stream_tough_ddl.pl

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

14 января 2019 г., 21:23:31

Hi,

Attached is an updated patch series, merging fixes and changes to TAP
tests proposed by Alexey. I've merged the fixes into the appropriate
patches, and I've kept the TAP changes / new tests as separate patches
towards the end of the series.

I'm a bit unhappy with two aspects of the current patch series:

1) We now track schema changes in two ways - using the pre-existing
schema_sent flag in RelationSyncEntry, and the (newly added) flag in
ReorderBuffer. While those options are used for regular vs. streamed
transactions, fundamentally it's the same thing and so having two
competing ways seems like a bad idea. Not sure what's the best way to
resolve this, though.

2) We've removed quite a few asserts, particularly ensuring sanity of
cmin/cmax values. To some extent that's expected, because by allowing
decoding of in-progress transactions relaxes some of those rules. But
I'd be much happier if some of those asserts could be reinstated, even
if only in a weaker form.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Michael Paquier

Дата:

04 февраля 2019 г., 08:51:31

On Mon, Jan 14, 2019 at 07:23:31PM +0100, Tomas Vondra wrote:
> Attached is an updated patch series, merging fixes and changes to TAP
> tests proposed by Alexey. I've merged the fixes into the appropriate
> patches, and I've kept the TAP changes / new tests as separate patches
> towards the end of the series.

Patch 4 of the latest set fails to apply, so I have moved the patch to
next CF, waiting on author.
--
Michael

Вложения

signature.asc

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alexey Kondratov

Дата:

04 февраля 2019 г., 16:49:08

Hi Tomas,

On 14.01.2019 21:23, Tomas Vondra wrote:
> Attached is an updated patch series, merging fixes and changes to TAP
> tests proposed by Alexey. I've merged the fixes into the appropriate
> patches, and I've kept the TAP changes / new tests as separate patches
> towards the end of the series.

I had problems applying this patch along with 2pc streaming one to the 
current master, but everything applied well on 97c39498e5. Regression 
tests pass. What I personally do not like in the current TAP tests set 
is that you have added "WITH (streaming=on)" to all tests including old 
non-streaming ones. It seems unclear, which mechanism is tested there: 
streaming, but those transactions probably do not hit memory limit, so 
it depends on default server parameters; or non-streaming, but then what 
is the need for (streaming=on)? I would prefer to add (streaming=on) 
only to the new tests, where it is clearly necessary.

> I'm a bit unhappy with two aspects of the current patch series:
>
> 1) We now track schema changes in two ways - using the pre-existing
> schema_sent flag in RelationSyncEntry, and the (newly added) flag in
> ReorderBuffer. While those options are used for regular vs. streamed
> transactions, fundamentally it's the same thing and so having two
> competing ways seems like a bad idea. Not sure what's the best way to
> resolve this, though.

Yes, sure, when I have found problems with streaming of extensive DDL, I 
added new flag in the simplest way, and it worked. Now, old schema_sent 
flag is per relation based, while the new one - is_schema_sent - is per 
top-level transaction based. If I get it correctly, the former seems to 
be more thrifty, since new schema is sent only if we are streaming 
change for relation, whose schema is outdated. In contrast, in the 
latter case we will send new schema even if there will be no new changes 
which belong to this relation.

I guess, it would be better to stick to the old behavior. I will try to 
investigate how to better use it in the streaming mode as well.

> 2) We've removed quite a few asserts, particularly ensuring sanity of
> cmin/cmax values. To some extent that's expected, because by allowing
> decoding of in-progress transactions relaxes some of those rules. But
> I'd be much happier if some of those asserts could be reinstated, even
> if only in a weaker form.

Asserts have been removed from two places: (1) 
HeapTupleSatisfiesHistoricMVCC, which seems inevitable, since we are 
touching the essence of the MVCC visibility rules, when trying to decode 
an in-progress transaction, and (2) ReorderBufferBuildTupleCidHash, 
which is probably not related directly to the topic of the ongoing 
patch, since Arseny Sher faced the same issue with simple repetitive DDL 
decoding [1] recently.

Not many, but I agree, that replacing them with some softer asserts 
would be better, than just removing, especially point 1).

[1] https://www.postgresql.org/message-id/flat/874l9p8hyw.fsf%40ars-thinkpad

Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alexey Kondratov

Дата:

28 августа 2019 г., 20:17:47

Hi Tomas,

>>>> Interesting. Any idea where does the extra overhead in this particular
>>>> case come from? It's hard to deduce that from the single flame graph,
>>>> when I don't have anything to compare it with (i.e. the flame graph 
>>>> for
>>>> the "normal" case).
>>> I guess that bottleneck is in disk operations. You can check
>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>> please, see attached flame graph for the following transaction:
>>>
>>> INSERT INTO large_text
>>> SELECT (SELECT string_agg('x', ',')
>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>
>>> Execution Time: 44519.816 ms
>>> Time: 98333,642 ms (01:38,334)
>>>
>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD for 
>>> tests.
>>>
>>> Therefore, probably you may write changes on receiver in bigger chunks,
>>> not each change separately.
>>>
>> Possibly, I/O is certainly a possible culprit, although we should be
>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>> not sure why would it be cheaper to do the writes in batches.
>>
>> BTW does this mean you see the overhead on the apply side? Or are you
>> running this on a single machine, and it's difficult to decide?
>
> I run this on a single machine, but walsender and worker are utilizing 
> almost 100% of CPU per each process all the time, and at apply side 
> I/O syscalls take about 1/3 of CPU time. Though I am still not sure, 
> but for me this result somehow links performance drop with problems at 
> receiver side.
>
> Writing in batches was just a hypothesis and to validate it I have 
> performed test with large txn, but consisting of a smaller number of 
> wide rows. This test does not exhibit any significant performance 
> drop, while it was streamed too. So it seems to be valid. Anyway, I do 
> not have other reasonable ideas beside that right now.

I've checked recently this patch again and tried to elaborate it in 
terms of performance. As a result I've implemented a new POC version of 
the applier (attached). Almost everything in streaming logic stayed 
intact, but apply worker is significantly different.

As I wrote earlier I still claim, that spilling changes on disk at the 
applier side adds additional overhead, but it is possible to get rid of 
it. In my additional patch I do the following:

1) Maintain a pool of additional background workers (bgworkers), that 
are connected with main logical apply worker via shm_mq's. Each worker 
is dedicated to the processing of specific streamed transaction.

2) When we receive a streamed change for some transaction, we check 
whether there is an existing dedicated bgworker in HTAB (xid -> 
bgworker), or there are some in the idle list, or spawn a new one.

3) We pass all changes (between STREAM START/STOP) to that bgworker via 
shm_mq_send without intermediate waiting. However, we wait for bgworker 
to apply the entire changes chunk at STREAM STOP, since we don't want 
transactions reordering.

4) When transaction is commited/aborted worker is being added to the 
idle list and is waiting for reassigning message.

5) I have used the same machinery with apply_dispatch in bgworkers, 
since most of actions are practically very similar.

Thus, we do not spill anything at the applier side, so transaction 
changes are processed by bgworkers as normal backends do. In the same 
time, changes processing is strictly serial, which prevents transactions 
reordering and possible conflicts/anomalies. Even though we trade off 
performance in favor of stability the result is rather impressive. I 
have used a similar query for testing as before:

EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
     SELECT round(random()*10), random(), random()*142
     FROM generate_series(1, 1000000) s(i);

with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and 
synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is 
following:

CREATE TABLE large_test (
     id serial primary key,
     num1 bigint,
     num2 double precision,
     num3 double precision
);

Here are the results:

-------------------------------------------------------------------
| N | Time on master, sec | Total xact time, sec |     Ratio      |
-------------------------------------------------------------------
|                        On commit (master, v13)                  |
-------------------------------------------------------------------
| 1kk | 6.5               | 17.6                 | x2.74          |
-------------------------------------------------------------------
| 3kk | 21                | 55.4                 | x2.64          |
-------------------------------------------------------------------
| 5kk | 38.3              | 91.5                 | x2.39          |
-------------------------------------------------------------------
|                        Stream + spill                           |
-------------------------------------------------------------------
| 1kk | 5.9               | 18                   | x3             |
-------------------------------------------------------------------
| 3kk | 19.5              | 52.4                 | x2.7           |
-------------------------------------------------------------------
| 5kk | 33.3              | 86.7                 | x2.86          |
-------------------------------------------------------------------
|                        Stream + BGW pool                        |
-------------------------------------------------------------------
| 1kk | 6                 | 12                   | x2             |
-------------------------------------------------------------------
| 3kk | 18.5              | 30.5                 | x1.65          |
-------------------------------------------------------------------
| 5kk | 35.6              | 53.9                 | x1.51          |
-------------------------------------------------------------------

It seems that overhead added by synchronous replica is lower by 2-3 
times compared with Postgres master and streaming with spilling. 
Therefore, the original patch eliminated delay before large transaction 
processing start by sender, while this additional patch speeds up the 
applier side.

Although the overall speed up is surely measurable, there is a room for 
improvements yet:

1) Currently bgworkers are only spawned on demand without some initial 
pool and never stopped. Maybe we should create a small pool on 
replication start and offload some of idle bgworkers if they exceed some 
limit?

2) Probably we can track somehow that incoming change has conflicts with 
some of being processed xacts, so we can wait for specific bgworkers 
only in that case?

3) Since the communication between main logical apply worker and each 
bgworker from the pool is a 'single producer --- single consumer' 
problem, then probably it is possible to wait and set/check flags 
without locks, but using just atomics.

What do you think about this concept in general? Any concerns and 
criticism are welcome!


Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

P.S. This patch shloud be applicable to your last patch set. I would rebase it against master, but it depends on 2pc
patch,that I don't know well enough.

Вложения

0011-BGWorkers-pool-for-streamed-transactions-apply-witho.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

28 августа 2019 г., 22:06:46

On Wed, Aug 28, 2019 at 08:17:47PM +0300, Alexey Kondratov wrote:
>Hi Tomas,
>
>>>>>Interesting. Any idea where does the extra overhead in this particular
>>>>>case come from? It's hard to deduce that from the single flame graph,
>>>>>when I don't have anything to compare it with (i.e. the flame 
>>>>>graph for
>>>>>the "normal" case).
>>>>I guess that bottleneck is in disk operations. You can check
>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>please, see attached flame graph for the following transaction:
>>>>
>>>>INSERT INTO large_text
>>>>SELECT (SELECT string_agg('x', ',')
>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>
>>>>Execution Time: 44519.816 ms
>>>>Time: 98333,642 ms (01:38,334)
>>>>
>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>~x4-5 performance drop here. JFYI, I am using a machine with SSD 
>>>>for tests.
>>>>
>>>>Therefore, probably you may write changes on receiver in bigger chunks,
>>>>not each change separately.
>>>>
>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>not sure why would it be cheaper to do the writes in batches.
>>>
>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>running this on a single machine, and it's difficult to decide?
>>
>>I run this on a single machine, but walsender and worker are 
>>utilizing almost 100% of CPU per each process all the time, and at 
>>apply side I/O syscalls take about 1/3 of CPU time. Though I am 
>>still not sure, but for me this result somehow links performance 
>>drop with problems at receiver side.
>>
>>Writing in batches was just a hypothesis and to validate it I have 
>>performed test with large txn, but consisting of a smaller number of 
>>wide rows. This test does not exhibit any significant performance 
>>drop, while it was streamed too. So it seems to be valid. Anyway, I 
>>do not have other reasonable ideas beside that right now.
>
>I've checked recently this patch again and tried to elaborate it in 
>terms of performance. As a result I've implemented a new POC version 
>of the applier (attached). Almost everything in streaming logic stayed 
>intact, but apply worker is significantly different.
>
>As I wrote earlier I still claim, that spilling changes on disk at the 
>applier side adds additional overhead, but it is possible to get rid 
>of it. In my additional patch I do the following:
>
>1) Maintain a pool of additional background workers (bgworkers), that 
>are connected with main logical apply worker via shm_mq's. Each worker 
>is dedicated to the processing of specific streamed transaction.
>
>2) When we receive a streamed change for some transaction, we check 
>whether there is an existing dedicated bgworker in HTAB (xid -> 
>bgworker), or there are some in the idle list, or spawn a new one.
>
>3) We pass all changes (between STREAM START/STOP) to that bgworker 
>via shm_mq_send without intermediate waiting. However, we wait for 
>bgworker to apply the entire changes chunk at STREAM STOP, since we 
>don't want transactions reordering.
>
>4) When transaction is commited/aborted worker is being added to the 
>idle list and is waiting for reassigning message.
>
>5) I have used the same machinery with apply_dispatch in bgworkers, 
>since most of actions are practically very similar.
>
>Thus, we do not spill anything at the applier side, so transaction 
>changes are processed by bgworkers as normal backends do. In the same 
>time, changes processing is strictly serial, which prevents 
>transactions reordering and possible conflicts/anomalies. Even though 
>we trade off performance in favor of stability the result is rather 
>impressive. I have used a similar query for testing as before:
>
>EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
>    SELECT round(random()*10), random(), random()*142
>    FROM generate_series(1, 1000000) s(i);
>
>with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB and 
>synchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema is 
>following:
>
>CREATE TABLE large_test (
>    id serial primary key,
>    num1 bigint,
>    num2 double precision,
>    num3 double precision
>);
>
>Here are the results:
>
>-------------------------------------------------------------------
>| N | Time on master, sec | Total xact time, sec |     Ratio      |
>-------------------------------------------------------------------
>|                        On commit (master, v13)                  |
>-------------------------------------------------------------------
>| 1kk | 6.5               | 17.6                 | x2.74          |
>-------------------------------------------------------------------
>| 3kk | 21                | 55.4                 | x2.64          |
>-------------------------------------------------------------------
>| 5kk | 38.3              | 91.5                 | x2.39          |
>-------------------------------------------------------------------
>|                        Stream + spill                           |
>-------------------------------------------------------------------
>| 1kk | 5.9               | 18                   | x3             |
>-------------------------------------------------------------------
>| 3kk | 19.5              | 52.4                 | x2.7           |
>-------------------------------------------------------------------
>| 5kk | 33.3              | 86.7                 | x2.86          |
>-------------------------------------------------------------------
>|                        Stream + BGW pool                        |
>-------------------------------------------------------------------
>| 1kk | 6                 | 12                   | x2             |
>-------------------------------------------------------------------
>| 3kk | 18.5              | 30.5                 | x1.65          |
>-------------------------------------------------------------------
>| 5kk | 35.6              | 53.9                 | x1.51          |
>-------------------------------------------------------------------
>
>It seems that overhead added by synchronous replica is lower by 2-3 
>times compared with Postgres master and streaming with spilling. 
>Therefore, the original patch eliminated delay before large 
>transaction processing start by sender, while this additional patch 
>speeds up the applier side.
>
>Although the overall speed up is surely measurable, there is a room 
>for improvements yet:
>
>1) Currently bgworkers are only spawned on demand without some initial 
>pool and never stopped. Maybe we should create a small pool on 
>replication start and offload some of idle bgworkers if they exceed 
>some limit?
>
>2) Probably we can track somehow that incoming change has conflicts 
>with some of being processed xacts, so we can wait for specific 
>bgworkers only in that case?
>
>3) Since the communication between main logical apply worker and each 
>bgworker from the pool is a 'single producer --- single consumer' 
>problem, then probably it is possible to wait and set/check flags 
>without locks, but using just atomics.
>
>What do you think about this concept in general? Any concerns and 
>criticism are welcome!
>

Hi Alexey,

I'm unable to do any in-depth review of the patch over the next two weeks
or so, but I think the idea of having a pool of apply workers is sound and
can be quite beneficial for some workloads.

I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.

There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same commit
order, and introducing deadlocks that would not exist in single-process
apply.

Surely, we want to end up with the same commit order as on the upstream,
otherwise we might easily get different data on the subscriber. So when we
pass the large transaction to a separate process, then this process has
to wait for the other processes processing transactions that committed
first. And similarly, other processes have to wait for this process.
Depending on the commit order. I might have missed something, but I don't
see anything like that in your patch.

Essentially, this means there needs to be some sort of wait between those
apply processes, enforcing the commit order.

That however means we can easily introduce deadlocks into workloads where
the serial-apply would not have that issue - imagine multiple large
transactions, touching the same set of rows. We may ship them to different
bgworkers, and those processes may deadlock.

Of course, the deadlock detector will come around (assuming the wait is
done in a way visible to the detector) and will abort one of the
processes. But we don't know it'll abort the right one - it may easily
abort the apply process that needs to comit first, and eveyone else is
waitiing for it. Which stalls the apply forever.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alexey Kondratov

Дата:

29 августа 2019 г., 17:37:45

On 28.08.2019 22:06, Tomas Vondra wrote:
>
>>
>>>>>> Interesting. Any idea where does the extra overhead in this 
>>>>>> particular
>>>>>> case come from? It's hard to deduce that from the single flame 
>>>>>> graph,
>>>>>> when I don't have anything to compare it with (i.e. the flame 
>>>>>> graph for
>>>>>> the "normal" case).
>>>>> I guess that bottleneck is in disk operations. You can check
>>>>> logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>> writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>> please, see attached flame graph for the following transaction:
>>>>>
>>>>> INSERT INTO large_text
>>>>> SELECT (SELECT string_agg('x', ',')
>>>>> FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>
>>>>> Execution Time: 44519.816 ms
>>>>> Time: 98333,642 ms (01:38,334)
>>>>>
>>>>> where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>> ~x4-5 performance drop here. JFYI, I am using a machine with SSD 
>>>>> for tests.
>>>>>
>>>>> Therefore, probably you may write changes on receiver in bigger 
>>>>> chunks,
>>>>> not each change separately.
>>>>>
>>>> Possibly, I/O is certainly a possible culprit, although we should be
>>>> using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>> not sure why would it be cheaper to do the writes in batches.
>>>>
>>>> BTW does this mean you see the overhead on the apply side? Or are you
>>>> running this on a single machine, and it's difficult to decide?
>>>
>>> I run this on a single machine, but walsender and worker are 
>>> utilizing almost 100% of CPU per each process all the time, and at 
>>> apply side I/O syscalls take about 1/3 of CPU time. Though I am 
>>> still not sure, but for me this result somehow links performance 
>>> drop with problems at receiver side.
>>>
>>> Writing in batches was just a hypothesis and to validate it I have 
>>> performed test with large txn, but consisting of a smaller number of 
>>> wide rows. This test does not exhibit any significant performance 
>>> drop, while it was streamed too. So it seems to be valid. Anyway, I 
>>> do not have other reasonable ideas beside that right now.
>>
>> It seems that overhead added by synchronous replica is lower by 2-3 
>> times compared with Postgres master and streaming with spilling. 
>> Therefore, the original patch eliminated delay before large 
>> transaction processing start by sender, while this additional patch 
>> speeds up the applier side.
>>
>> Although the overall speed up is surely measurable, there is a room 
>> for improvements yet:
>>
>> 1) Currently bgworkers are only spawned on demand without some 
>> initial pool and never stopped. Maybe we should create a small pool 
>> on replication start and offload some of idle bgworkers if they 
>> exceed some limit?
>>
>> 2) Probably we can track somehow that incoming change has conflicts 
>> with some of being processed xacts, so we can wait for specific 
>> bgworkers only in that case?
>>
>> 3) Since the communication between main logical apply worker and each 
>> bgworker from the pool is a 'single producer --- single consumer' 
>> problem, then probably it is possible to wait and set/check flags 
>> without locks, but using just atomics.
>>
>> What do you think about this concept in general? Any concerns and 
>> criticism are welcome!
>>
>

Hi Tomas,

Thank you for a quick response.

> I don't think it matters very much whether the workers are started at the
> beginning or allocated ad hoc, that's IMO a minor implementation detail.

OK, I had the same vision about this point. Any minor differences here 
will be neglectable for a sufficiently large transaction.

>
> There's one huge challenge that I however don't see mentioned in your
> message or in the patch (after cursory reading) - ensuring the same 
> commit
> order, and introducing deadlocks that would not exist in single-process
> apply.

Probably I haven't explained well this part, sorry for that. In my patch 
I don't use workers pool for a concurrent transaction apply, but rather 
for a fast context switch between long-lived streamed transactions. In 
other words we apply all changes arrived from the sender in a completely 
serial manner. Being written step-by-step it looks like:

1) Read STREAM START message and figure out the target worker by xid.

2) Put all changes, which belongs to this xact to the selected worker 
one by one via shm_mq_send.

3) Read STREAM STOP message and wait until our worker will apply all 
changes in the queue.

4) Process all other chunks of streamed xacts in the same manner.

5) Process all non-streamed xacts immediately in the main apply worker loop.

6) If we read STREAMED COMMIT/ABORT we again wait until selected worker 
either commits or aborts.

Thus, it automatically guaranties the same commit order on replica as on 
master. Yes, we loose some performance here, since we don't apply 
transactions concurrently, but it would bring all those problems you 
have described.

However, you helped me to figure out another point I have forgotten. 
Although we ensure commit order automatically, the beginning of streamed 
xacts may reorder. It happens if some small xacts have been commited on 
master since the streamed one started, because we do not start streaming 
immediately, but only after logical_work_mem hit. I have performed some 
tests with conflicting xacts and it seems that it's not a problem, since 
locking mechanism in Postgres guarantees that if there would some 
deadlocks, they will happen earlier on master. So if some records hit 
the WAL, it is safe to apply the sequentially. Am I wrong?

Anyway, I'm going to double check the safety of this part later.

Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

29 августа 2019 г., 21:48:24

On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:
>On 28.08.2019 22:06, Tomas Vondra wrote:
>>
>>>
>>>>>>>Interesting. Any idea where does the extra overhead in 
>>>>>>>this particular
>>>>>>>case come from? It's hard to deduce that from the single 
>>>>>>>flame graph,
>>>>>>>when I don't have anything to compare it with (i.e. the 
>>>>>>>flame graph for
>>>>>>>the "normal" case).
>>>>>>I guess that bottleneck is in disk operations. You can check
>>>>>>logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
>>>>>>writes (~26%) take around 35% of CPU time in summary. To compare,
>>>>>>please, see attached flame graph for the following transaction:
>>>>>>
>>>>>>INSERT INTO large_text
>>>>>>SELECT (SELECT string_agg('x', ',')
>>>>>>FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
>>>>>>
>>>>>>Execution Time: 44519.816 ms
>>>>>>Time: 98333,642 ms (01:38,334)
>>>>>>
>>>>>>where disk IO is only ~7-8% in total. So we get very roughly the same
>>>>>>~x4-5 performance drop here. JFYI, I am using a machine with 
>>>>>>SSD for tests.
>>>>>>
>>>>>>Therefore, probably you may write changes on receiver in 
>>>>>>bigger chunks,
>>>>>>not each change separately.
>>>>>>
>>>>>Possibly, I/O is certainly a possible culprit, although we should be
>>>>>using buffered I/O and there certainly are not any fsyncs here. So I'm
>>>>>not sure why would it be cheaper to do the writes in batches.
>>>>>
>>>>>BTW does this mean you see the overhead on the apply side? Or are you
>>>>>running this on a single machine, and it's difficult to decide?
>>>>
>>>>I run this on a single machine, but walsender and worker are 
>>>>utilizing almost 100% of CPU per each process all the time, and 
>>>>at apply side I/O syscalls take about 1/3 of CPU time. Though I 
>>>>am still not sure, but for me this result somehow links 
>>>>performance drop with problems at receiver side.
>>>>
>>>>Writing in batches was just a hypothesis and to validate it I 
>>>>have performed test with large txn, but consisting of a smaller 
>>>>number of wide rows. This test does not exhibit any significant 
>>>>performance drop, while it was streamed too. So it seems to be 
>>>>valid. Anyway, I do not have other reasonable ideas beside that 
>>>>right now.
>>>
>>>It seems that overhead added by synchronous replica is lower by 
>>>2-3 times compared with Postgres master and streaming with 
>>>spilling. Therefore, the original patch eliminated delay before 
>>>large transaction processing start by sender, while this 
>>>additional patch speeds up the applier side.
>>>
>>>Although the overall speed up is surely measurable, there is a 
>>>room for improvements yet:
>>>
>>>1) Currently bgworkers are only spawned on demand without some 
>>>initial pool and never stopped. Maybe we should create a small 
>>>pool on replication start and offload some of idle bgworkers if 
>>>they exceed some limit?
>>>
>>>2) Probably we can track somehow that incoming change has 
>>>conflicts with some of being processed xacts, so we can wait for 
>>>specific bgworkers only in that case?
>>>
>>>3) Since the communication between main logical apply worker and 
>>>each bgworker from the pool is a 'single producer --- single 
>>>consumer' problem, then probably it is possible to wait and 
>>>set/check flags without locks, but using just atomics.
>>>
>>>What do you think about this concept in general? Any concerns and 
>>>criticism are welcome!
>>>
>>
>
>Hi Tomas,
>
>Thank you for a quick response.
>
>>I don't think it matters very much whether the workers are started at the
>>beginning or allocated ad hoc, that's IMO a minor implementation detail.
>
>OK, I had the same vision about this point. Any minor differences here 
>will be neglectable for a sufficiently large transaction.
>
>>
>>There's one huge challenge that I however don't see mentioned in your
>>message or in the patch (after cursory reading) - ensuring the same 
>>commit
>>order, and introducing deadlocks that would not exist in single-process
>>apply.
>
>Probably I haven't explained well this part, sorry for that. In my 
>patch I don't use workers pool for a concurrent transaction apply, but 
>rather for a fast context switch between long-lived streamed 
>transactions. In other words we apply all changes arrived from the 
>sender in a completely serial manner. Being written step-by-step it 
>looks like:
>
>1) Read STREAM START message and figure out the target worker by xid.
>
>2) Put all changes, which belongs to this xact to the selected worker 
>one by one via shm_mq_send.
>
>3) Read STREAM STOP message and wait until our worker will apply all 
>changes in the queue.
>
>4) Process all other chunks of streamed xacts in the same manner.
>
>5) Process all non-streamed xacts immediately in the main apply worker loop.
>
>6) If we read STREAMED COMMIT/ABORT we again wait until selected 
>worker either commits or aborts.
>
>Thus, it automatically guaranties the same commit order on replica as 
>on master. Yes, we loose some performance here, since we don't apply 
>transactions concurrently, but it would bring all those problems you 
>have described.
>

OK, so it's apply in multiple processes, but at any moment only a single
apply process is active. 

>However, you helped me to figure out another point I have forgotten. 
>Although we ensure commit order automatically, the beginning of 
>streamed xacts may reorder. It happens if some small xacts have been 
>commited on master since the streamed one started, because we do not 
>start streaming immediately, but only after logical_work_mem hit. I 
>have performed some tests with conflicting xacts and it seems that 
>it's not a problem, since locking mechanism in Postgres guarantees 
>that if there would some deadlocks, they will happen earlier on 
>master. So if some records hit the WAL, it is safe to apply the 
>sequentially. Am I wrong?
>

I think you're right the way you interleave the changes ensures you
can't introduce new deadlocks between transactions in this stream. I don't
think reordering the blocks of streamed trasactions does matter, as long
as the comit order is ensured in this case.

>Anyway, I'm going to double check the safety of this part later.
>

OK.

FWIW my understanding is that the speedup comes mostly from elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Konstantin Knizhnik

Дата:

30 августа 2019 г., 18:59:32

>
> FWIW my understanding is that the speedup comes mostly from 
> elimination of
> the serialization to a file. That however requires savepoints to handle
> aborts of subtransactions - I'm pretty sure I'd be trivial to create a
> workload where this will be much slower (with many aborts of large
> subtransactions).
>
>

I think that instead of defining savepoints it is simpler and more 
efficient to use

BeginInternalSubTransaction + 
ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction

as it is done in PL/pgSQL (pl_exec.c).
Not sure if it can pr

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alvaro Herrera

Дата:

03 сентября 2019 г., 01:06:50

In the interest of moving things forward, how far are we from making
0001 committable?  If I understand correctly, the rest of this patchset
depends on https://commitfest.postgresql.org/24/944/ which seems to be
moving at a glacial pace (or, actually, slower, because glaciers do
move, which cannot be said of that other patch.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

03 сентября 2019 г., 13:39:09

On Mon, Sep 02, 2019 at 06:06:50PM -0400, Alvaro Herrera wrote:
>In the interest of moving things forward, how far are we from making
>0001 committable?  If I understand correctly, the rest of this patchset
>depends on https://commitfest.postgresql.org/24/944/ which seems to be
>moving at a glacial pace (or, actually, slower, because glaciers do
>move, which cannot be said of that other patch.)
>

I think 0001 is mostly there. I think there's one bug in this patch
version, but I need to check and I'll post an updated version shortly if
needed.

FWIW maybe we should stop comparing things to glaciers. 50 years from not
people won't know what a glacier is, and it'll be just like the floppy
icon on the save button.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alexey Kondratov

Дата:

16 сентября 2019 г., 19:54:32

>>
>> FWIW my understanding is that the speedup comes mostly from 
>> elimination of
>> the serialization to a file. That however requires savepoints to handle
>> aborts of subtransactions - I'm pretty sure I'd be trivial to create a
>> workload where this will be much slower (with many aborts of large
>> subtransactions).
>>

Yes, and it was my main motivation to eliminate that extra serialization 
to file. I've experimented a bit with large transactions + savepoints + 
aborts and ended up with a following query (the same schema as before 
with 600k rows):

BEGIN;
SAVEPOINT s1;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s2;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
SAVEPOINT s3;
UPDATE large_test SET num1 = num1 + 1, num2 = num2 + 1, num3 = num3 + 1;
ROLLBACK TO SAVEPOINT s3;
ROLLBACK TO SAVEPOINT s2;
ROLLBACK TO SAVEPOINT s1;
END;

It looks like the worst case scenario, as we do a lot of work and then 
abort all subxacts one by one. As expected,it takes much longer (up to 
x30) to process using background worker instead of spilling to file. 
Surely, it is much easier to truncate a file, than apply all changes + 
abort. However, I guess that this kind of load pattern is not the most 
typical for real-life applications.

Also this test helped me to find a bug in my current savepoints routine, 
so new patch is attached.

On 30.08.2019 18:59, Konstantin Knizhnik wrote:
>
> I think that instead of defining savepoints it is simpler and more 
> efficient to use
>
> BeginInternalSubTransaction + 
> ReleaseCurrentSubTransaction/RollbackAndReleaseCurrentSubTransaction
>
> as it is done in PL/pgSQL (pl_exec.c).
> Not sure if it can pr
>

Both BeginInternalSubTransaction and DefineSavepoint use 
PushTransaction() internally for a normal subtransaction start. So they 
seems to be identical from the performance perspective, which is also 
stated in the comment section:

/*
  * BeginInternalSubTransaction
  *        This is the same as DefineSavepoint except it allows 
TBLOCK_STARTED,
  *        TBLOCK_IMPLICIT_INPROGRESS, TBLOCK_END, and TBLOCK_PREPARE 
states,
  *        and therefore it can safely be used in functions that might 
be called
  *        when not inside a BEGIN block or when running deferred 
triggers at
  *        COMMIT/PREPARE time.  Also, it automatically does
  *        CommitTransactionCommand/StartTransactionCommand instead of 
expecting
  *        the caller to do it.
  */

Please, correct me if I'm wrong.

Anyway, I've performed a profiling of my apply worker (flamegraph is 
attached) and it spends the vast amount of time (>90%) applying changes. 
So the problem is not in the savepoints their-self, but in the fact that 
we first apply all changes and then abort all the work. Not sure, that 
it is possible to do something in this case.

Regards

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Hi,

Attached is an updated patch series, rebased on current master. It does
fix one memory accounting bug in ReorderBufferToastReplace (the code was
not properly updating the amount of memory).

I've also included the patch series with decoding of 2PC transactions,
which this depends on. This way we have a chance of making the cfbot
happy. So parts 0001-0004 and 0009-0014 are "this" patch series, while
0005-0008 are the extra pieces from the other patch.

I've done it like this because the initial parts are independent, and so
might be committed irrespectedly of the other patch series. In practice
that's only reasonable for 0001, which adds the memory limit - the rest
is infrastucture for the streaming of in-progress transactions.

On Wed, Sep 25, 2019 at 06:55:01PM +0530, Amit Kapila wrote:
>On Tue, Sep 3, 2019 at 4:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> In the interest of moving things forward, how far are we from making
>> 0001 committable?  If I understand correctly, the rest of this patchset
>> depends on https://commitfest.postgresql.org/24/944/ which seems to be
>> moving at a glacial pace (or, actually, slower, because glaciers do
>> move, which cannot be said of that other patch.)
>>
>
>I am not sure if it is completely correct that the other part of the
>patch is dependent on that CF entry.  I have studied both the threads
>(not every detail) and it seems to me it is dependent on one of the
>patches from that series which handles concurrent aborts.  It is patch
>0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.Jan4.patch
>from what the Nikhil has posted on that thread [1].  Am, I wrong?
>

You're right - the part handling aborts is the only part required. There
are dependencies on some other changes from the 2PC patch, but those are
mostly refactorings that can be undone (e.g. switch from independent
flags to a single bitmap in reorderbuffer).

>So IIUC, the problem of concurrent aborts is that if we allow catalog
>scans for in-progress transactions, then we might get wrong answers in
>cases where somebody has performed Alter-Abort-Alter which is clearly
>explained with an example in email [2].  To solve that problem Nikhil
>seems to have written a patch [1] which detects these concurrent
>aborts during a system table scan and then aborts the decoding of such
>a transaction.
>
>Now, the problem is that patch has written considering 2PC
>transactions and might not deal with all cases for in-progress
>transactions especially when sub-transactions are involved as alluded
>by Arseny Sher [3].  So, the problem seems to be for cases when some
>sub-transaction aborts, but the main transaction still continued and
>we try to decode it.  Nikhil's patch won't be able to deal with it
>because I think it just checks top-level xid whereas for this we need
>to check all-subxids which I think is possible now as Tomas seems to
>have written WAL for each xid-assignment.  It might or might not be
>the best solution to check the status of all-subxids, but I think
>first we need to agree that the problem is just for concurrent aborts
>and that we can solve it by using some part of the technology being
>developed as part of patch "Logical decoding of two-phase
>transactions" (https://commitfest.postgresql.org/24/944/) rather than
>the entire patchset.
>
>I hope I am not saying something very obvious here and it helps in
>moving this patch forward.
>

No, that's a good question, and I'm not sure what the answer is at the
moment. My understanding was that the infrastructure in the 2PC patch is
enough even for subtransactions, but I might be wrong. I need to think
about that for a while.

Maybe we should focus on the 0001 part for now - it can be committed
indepently and does provide useful feature.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:
>On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Fri, Sep 27, 2019 at 02:33:32PM +0530, Amit Kapila wrote:
>> >On Tue, Jan 9, 2018 at 7:55 AM Peter Eisentraut
>> ><peter.eisentraut@2ndquadrant.com> wrote:
>> >>
>> >> On 1/3/18 14:53, Tomas Vondra wrote:
>> >> >> I don't see the need to tie this setting to maintenance_work_mem.
>> >> >> maintenance_work_mem is often set to very large values, which could
>> >> >> then have undesirable side effects on this use.
>> >> >
>> >> > Well, we need to pick some default value, and we can either use a fixed
>> >> > value (not sure what would be a good default) or tie it to an existing
>> >> > GUC. We only really have work_mem and maintenance_work_mem, and the
>> >> > walsender process will never use more than one such buffer. Which seems
>> >> > to be closer to maintenance_work_mem.
>> >> >
>> >> > Pretty much any default value can have undesirable side effects.
>> >>
>> >> Let's just make it an independent setting unless we know any better.  We
>> >> don't have a lot of settings that depend on other settings, and the ones
>> >> we do have a very specific relationship.
>> >>
>> >> >> Moreover, the name logical_work_mem makes it sound like it's a logical
>> >> >> version of work_mem.  Maybe we could think of another name.
>> >> >
>> >> > I won't object to a better name, of course. Any proposals?
>> >>
>> >> logical_decoding_[work_]mem?
>> >>
>> >
>> >Having a separate variable for this can give more flexibility, but
>> >OTOH it will add one more knob which user might not have a good idea
>> >to set.  What are the problems we see if directly use work_mem for
>> >this case?
>> >
>>
>> IMHO it's similar to autovacuum_work_mem - we have an independent
>> setting, but most people use it as -1 so we use maintenance_work_mem as
>> a default value. I think it makes sense to do the same thing here.
>>
>> It does ad an extra knob anyway (I don't think we should just use
>> maintenance_work_mem directly, the user should have an option to
>> override it when needed). But most users will not notice.
>>
>> FWIW I don't think we should use work_mem, maintenace_work_mem seems
>> somewhat more appropriate here (not related to queries, etc.).
>>
>
>I have the same concern for using maintenace_work_mem as Peter E.
>which is that the value of maintenace_work_mem will generally be
>higher which is suitable for its current purpose, but not for the
>purpose this patch is using.  AFAIU, at this stage we want a better
>memory accounting system for logical decoding and we are not sure what
>is a good value for this variable.  So, I think using work_mem or
>maintenace_work_mem should serve the purpose.  Later, if we have
>requirements from people to have better control over the memory
>required for this purpose then we can introduce a new variable.
>
>I understand that currently work_mem is primarily tied with memory
>used for query workspaces, but it might be okay to extend it for this
>purpose.  Another point is that the default for that sound to be more
>appealing for this case.  I can see the argument against it which is
>having a separate variable will make the things look clean and give
>better control.  So, if we can't convince ourselves for using
>work_mem, we can introduce a new guc variable and keep the default as
>4MB or work_mem.
>
>I feel it is always tempting to introduce a new guc for the different
>tasks unless there is an exact match, but OTOH, having lesser guc's
>has its own advantage which is that people don't have to bother about
>a new setting which they need to tune and especially for which they
>can't decide with ease.  I am not telling that we should not introduce
>new guc when it is required, but just to give more thought before
>doing so.
>

I do think having a separate GUC is a must, irrespectedly of what other
GUC (if any) is used as a default. You're right the maintenance_work_mem
value might be too high (e.g. in cases with many subscriptions), but the
same issue applies to work_mem - there's no guarantee work_mem is lower
than maintenance_work_mem, and in analytics databases it may be set very
high. So work_mem does not really solve the issue

IMHO we can't really do without a new GUC. It's not difficult to create
examples that would benefit from small/large memory limit, depending on
the number of subscriptions etc.

I do however agree the GUC does not have to be tied to any existing one,
it was just an attempt to use a more sensible default value. I do think
m_w_m would be fine, but I can live with using an explicit value.

So that's what I did in the attached patch - I've renamed the GUC to
logical_decoding_work_mem, detached it from m_w_m and set the default to
64MB (i.e. the same default as m_w_m). It should also fix all the issues
from the recent reviews (at least I believe so).

I've realized that one of the subsequent patches allows overriding the
limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
I think it'd be good to move this bit forward, but I think it can be
done in a separate patch.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 сентября 2019 г., 07:34:55

On Thu, Sep 26, 2019 at 11:38 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> No, that's a good question, and I'm not sure what the answer is at the
> moment. My understanding was that the infrastructure in the 2PC patch is
> enough even for subtransactions, but I might be wrong. I need to think
> about that for a while.
>
IIUC, for 2PC it's enough to check whether the main transaction is
aborted or not but for the in-progress transaction it's possible that
the current subtransaction might have done catalog changes and it
might get aborted when we are decoding.  So we need to extend an
infrastructure such that we can check the status of the transaction
for which we are decoding the change.  Also, I think we need to handle
the ERRCODE_TRANSACTION_ROLLBACK and ignore it.

I have attached a small patch to handle this which can be applied on
top of your patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

handle_concurrent_abort_for_in_progress_transaction.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

29 сентября 2019 г., 08:54:10

On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Sep 28, 2019 at 01:36:46PM +0530, Amit Kapila wrote:
> >On Fri, Sep 27, 2019 at 4:55 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
>
> I do think having a separate GUC is a must, irrespectedly of what other
> GUC (if any) is used as a default. You're right the maintenance_work_mem
> value might be too high (e.g. in cases with many subscriptions), but the
> same issue applies to work_mem - there's no guarantee work_mem is lower
> than maintenance_work_mem, and in analytics databases it may be set very
> high. So work_mem does not really solve the issue
>
> IMHO we can't really do without a new GUC. It's not difficult to create
> examples that would benefit from small/large memory limit, depending on
> the number of subscriptions etc.
>
> I do however agree the GUC does not have to be tied to any existing one,
> it was just an attempt to use a more sensible default value. I do think
> m_w_m would be fine, but I can live with using an explicit value.
>
> So that's what I did in the attached patch - I've renamed the GUC to
> logical_decoding_work_mem, detached it from m_w_m and set the default to
> 64MB (i.e. the same default as m_w_m).

Fair enough, let's not argue more on this unless someone else wants to
share his opinion.

> It should also fix all the issues
> from the recent reviews (at least I believe so).
>

Have you given any thought on creating a test case for this patch?  I
think you also told that you will test some worst-case scenarios and
report the numbers so that we are convinced that the current eviction
algorithm is good.

> I've realized that one of the subsequent patches allows overriding the
> limit for individual subscriptions (in the CREATE SUBSCRIPTION command).
> I think it'd be good to move this bit forward, but I think it can be
> done in a separate patch.
>

Yeah, it is better to deal it separately as I am also not entirely
convinced at this stage about this parameter.  I have mentioned the
same in the previous email as well.

While glancing through the changes, I noticed a small thing:
+#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem

I guess this need to be updated.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alvaro Herrera

Дата:

29 сентября 2019 г., 20:30:44

On 2019-Sep-29, Amit Kapila wrote:

> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> > So that's what I did in the attached patch - I've renamed the GUC to
> > logical_decoding_work_mem, detached it from m_w_m and set the default to
> > 64MB (i.e. the same default as m_w_m).
> 
> Fair enough, let's not argue more on this unless someone else wants to
> share his opinion.

I just read this part of the conversation and I agree that having a
separate GUC with its own value independent from other GUCs is a good
solution.  Tying it to m_w_m seemed reasonable, but it's true that
people frequently set m_w_m very high, and it would be undesirable to
propagate that value to logical decoding memory usage.

I wonder what would constitute good advice on how to set this value, I
mean what is the metric that the user needs to be thinking about.   Is
it the total of memory required to keep all concurrent write transactions 
in memory?  (Quick example: if you do 2048 wTPS and each transaction
lasts 1s, and each transaction does 1kB of logically-decoded changes,
then ~2MB are sufficient for the average case.  Is that correct?  I
*think* that full-page images do not count, correct?  With these things
in mind users could go through pg_waldump output and figure out what to
set the value to.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

29 сентября 2019 г., 20:56:30

On Sun, Sep 29, 2019 at 02:30:44PM -0300, Alvaro Herrera wrote:
>On 2019-Sep-29, Amit Kapila wrote:
>
>> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>> > So that's what I did in the attached patch - I've renamed the GUC to
>> > logical_decoding_work_mem, detached it from m_w_m and set the default to
>> > 64MB (i.e. the same default as m_w_m).
>>
>> Fair enough, let's not argue more on this unless someone else wants to
>> share his opinion.
>
>I just read this part of the conversation and I agree that having a
>separate GUC with its own value independent from other GUCs is a good
>solution.  Tying it to m_w_m seemed reasonable, but it's true that
>people frequently set m_w_m very high, and it would be undesirable to
>propagate that value to logical decoding memory usage.
>
>
>I wonder what would constitute good advice on how to set this value, I
>mean what is the metric that the user needs to be thinking about.   Is
>it the total of memory required to keep all concurrent write transactions
>in memory?  (Quick example: if you do 2048 wTPS and each transaction
>lasts 1s, and each transaction does 1kB of logically-decoded changes,
>then ~2MB are sufficient for the average case.  Is that correct? 

Yes, something like that. Essentially we'd like to keep all concurrent
transactions decoded in memory, to eliminate the need to spill to disk.
One of the subsequent patches adds some subscription-level stats, so
maybe we don't need to worry about this too much - the stats seem like a
better source of information for tuning.

>I *think* that full-page images do not count, correct?  With these
>things in mind users could go through pg_waldump output and figure out
>what to set the value to.)
>

Right, FPW do not matter here.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

01 октября 2019 г., 16:25:52

On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
>
> Yeah, it is better to deal it separately as I am also not entirely
> convinced at this stage about this parameter. I have mentioned the
> same in the previous email as well.
>
> While glancing through the changes, I noticed a small thing:
> +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use maintenance_work_mem
>
> I guess this need to be updated.
>

On further testing, I found that the patch seems to have problems with toast. Consider below scenario:
Session-1
Create table large_text(t1 text);
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);

Session-2
SELECT * FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL); --kaboom

The second statement in Session-2 leads to a crash.

Other than that, I am not sure if the changes related to spill to disk after logical_decoding_work_mem works for toast table as I couldn't hit that code for toast table case, but I might be missing something. As mentioned previously, I feel there should be some way to test whether this patch works for the cases it claims to work. As of now, I have to check via debugging. Let me know if there is any way, I can test this.

I am reluctant to say, but I think this patch still needs some more work (review, test, rework) before we can commit it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

01 октября 2019 г., 16:51:51

On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>On Sun, Sep 29, 2019 at 11:24 AM Amit Kapila <amit.kapila16@gmail.com>
>wrote:
>> On Sun, Sep 29, 2019 at 12:39 AM Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> >
>>
>> Yeah, it is better to deal it separately as I am also not entirely
>> convinced at this stage about this parameter.  I have mentioned the
>> same in the previous email as well.
>>
>> While glancing through the changes, I noticed a small thing:
>> +#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
>maintenance_work_mem
>>
>> I guess this need to be updated.
>>
>
>On further testing, I found that the patch seems to have problems with
>toast.  Consider below scenario:
>Session-1
>Create table large_text(t1 text);
>INSERT INTO large_text
>SELECT (SELECT string_agg('x', ',')
>FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>
>Session-2
>SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>'test_decoding');
>SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>*--kaboom*
>
>The second statement in Session-2 leads to a crash.
>

OK, thanks for the report - will investigate.

>Other than that, I am not sure if the changes related to spill to disk
>after logical_decoding_work_mem works for toast table as I couldn't hit
>that code for toast table case, but I might be missing something.  As
>mentioned previously, I feel there should be some way to test whether this
>patch works for the cases it claims to work.  As of now, I have to check
>via debugging.  Let me know if there is any way, I can test this.
>

That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.

>I am reluctant to say, but I think this patch still needs some more work
>(review, test, rework) before we can commit it.
>

I agreee.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

02 октября 2019 г., 01:57:30

On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>
>On further testing, I found that the patch seems to have problems with
>toast. Consider below scenario:
>Session-1
>Create table large_text(t1 text);
>INSERT INTO large_text
>SELECT (SELECT string_agg('x', ',')
>FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>
>Session-2
>SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>'test_decoding');
>SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>*--kaboom*
>
>The second statement in Session-2 leads to a crash.
>

OK, thanks for the report - will investigate.

It was an assertion failure in ReorderBufferCleanupTXN at below line:

+ /* Check we're not mixing changes from different transactions. */
+ Assert(change->txn == txn);

>Other than that, I am not sure if the changes related to spill to disk
>after logical_decoding_work_mem works for toast table as I couldn't hit
>that code for toast table case, but I might be missing something. As
>mentioned previously, I feel there should be some way to test whether this
>patch works for the cases it claims to work. As of now, I have to check
>via debugging. Let me know if there is any way, I can test this.
>

That's one of the reasons why I proposed to move the statistics (which
say how many transactions / bytes were spilled to disk) from a later
patch in the series. I don't think there's a better way.

I like that idea, but I think you need to split that patch to only get the stats related to the spill. It would be easier to review if you can prepare that atop of 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

03 октября 2019 г., 01:32:54

On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
>On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>> >
>> >On further testing, I found that the patch seems to have problems with
>> >toast.  Consider below scenario:
>> >Session-1
>> >Create table large_text(t1 text);
>> >INSERT INTO large_text
>> >SELECT (SELECT string_agg('x', ',')
>> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>> >
>> >Session-2
>> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>> >'test_decoding');
>> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>> >*--kaboom*
>> >
>> >The second statement in Session-2 leads to a crash.
>> >
>>
>> OK, thanks for the report - will investigate.
>>
>
>It was an assertion failure in ReorderBufferCleanupTXN at below line:
>+ /* Check we're not mixing changes from different transactions. */
>+ Assert(change->txn == txn);
>

Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.

>> >Other than that, I am not sure if the changes related to spill to disk
>> >after logical_decoding_work_mem works for toast table as I couldn't hit
>> >that code for toast table case, but I might be missing something.  As
>> >mentioned previously, I feel there should be some way to test whether this
>> >patch works for the cases it claims to work.  As of now, I have to check
>> >via debugging.  Let me know if there is any way, I can test this.
>> >
>>
>> That's one of the reasons why I proposed to move the statistics (which
>> say how many transactions / bytes were spilled to disk) from a later
>> patch in the series. I don't think there's a better way.
>>
>>
>I like that idea, but I think you need to split that patch to only get the
>stats related to the spill.  It would be easier to review if you can
>prepare that atop of
>0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
>

Sure, I wasn't really proposing to adding all stats from that patch,
including those related to streaming.  We need to extract just those
related to spilling. And yes, it needs to be moved right after 0001.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

03 октября 2019 г., 10:48:26

I have attempted to test the performance of (Stream + Spill) vs
(Stream + BGW pool) and I can see the similar gain what Alexey had
shown[1].

In addition to this, I have rebased the latest patchset [2] without
the two-phase logical decoding patch set.

Test results:
I have repeated the same test as Alexy[1] for 1kk and 1kk data and
here is my result
Stream + Spill
N           time on master(sec)   Total xact time (sec)
1kk               6                               21
3kk             18                               55

Stream + BGW pool
N          time on master(sec)  Total xact time (sec)
1kk              6                              13
3kk            19                              35

Patch details:
All the patches are the same as posted on [2] except
1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have
removed the handling of error which is specific for 2PC
2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC
3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New
patch to handle concurrent abort error for the in-progress transaction
and also add handling for the sub transaction's abort.
4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased
Alexey's patch

[1] https://www.postgresql.org/message-id/8eda5118-2dd0-79a1-4fe9-eec7e334de17%40postgrespro.ru
[2] https://www.postgresql.org/message-id/20190928190917.hrpknmq76v3ts3lj%40development

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> >wrote:
> >
> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
> >> >
> >> >On further testing, I found that the patch seems to have problems with
> >> >toast.  Consider below scenario:
> >> >Session-1
> >> >Create table large_text(t1 text);
> >> >INSERT INTO large_text
> >> >SELECT (SELECT string_agg('x', ',')
> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> >> >
> >> >Session-2
> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
> >> >'test_decoding');
> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
> >> >*--kaboom*
> >> >
> >> >The second statement in Session-2 leads to a crash.
> >> >
> >>
> >> OK, thanks for the report - will investigate.
> >>
> >
> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
> >+ /* Check we're not mixing changes from different transactions. */
> >+ Assert(change->txn == txn);
> >
>
> Can you still reproduce this issue with the patch I sent on 28/9? I have
> been unable to trigger the failure, and it seems pretty similar to the
> failure you reported (and I fixed) on 28/9.
>
> >> >Other than that, I am not sure if the changes related to spill to disk
> >> >after logical_decoding_work_mem works for toast table as I couldn't hit
> >> >that code for toast table case, but I might be missing something.  As
> >> >mentioned previously, I feel there should be some way to test whether this
> >> >patch works for the cases it claims to work.  As of now, I have to check
> >> >via debugging.  Let me know if there is any way, I can test this.
> >> >
> >>
> >> That's one of the reasons why I proposed to move the statistics (which
> >> say how many transactions / bytes were spilled to disk) from a later
> >> patch in the series. I don't think there's a better way.
> >>
> >>
> >I like that idea, but I think you need to split that patch to only get the
> >stats related to the spill.  It would be easier to review if you can
> >prepare that atop of
> >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
> >
>
> Sure, I wasn't really proposing to adding all stats from that patch,
> including those related to streaming.  We need to extract just those
> related to spilling. And yes, it needs to be moved right after 0001.
>
> regards
>
> --
> Tomas Vondra                  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

03 октября 2019 г., 12:13:09

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
>On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>wrote:
>
>> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>> >
>> >On further testing, I found that the patch seems to have problems with
>> >toast. Consider below scenario:
>> >Session-1
>> >Create table large_text(t1 text);
>> >INSERT INTO large_text
>> >SELECT (SELECT string_agg('x', ',')
>> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>> >
>> >Session-2
>> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>> >'test_decoding');
>> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>> >*--kaboom*
>> >
>> >The second statement in Session-2 leads to a crash.
>> >
>>
>> OK, thanks for the report - will investigate.
>>
>
>It was an assertion failure in ReorderBufferCleanupTXN at below line:
>+ /* Check we're not mixing changes from different transactions. */
>+ Assert(change->txn == txn);
>

Can you still reproduce this issue with the patch I sent on 28/9? I have
been unable to trigger the failure, and it seems pretty similar to the
failure you reported (and I fixed) on 28/9.

Yes, it seems we need a similar change in ReorderBufferAddNewTupleCids. I think in session-2 you need to create replication slot before creating table in session-1 to see this problem.

--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
change->data.tuplecid.cmax = cmax;
change->data.tuplecid.combocid = combocid;
change->lsn = lsn;
+ change->txn = txn;
change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
dlist_push_tail(&txn->tuplecids, &change->node);

Few more comments:

-----------------------------------

+static bool
+check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
+{
+ /*
+ * -1 indicates fallback.
+ *
+ * If we haven't yet changed the boot_val default of -1, just let it be.
+ * logical decoding will look to maintenance_work_mem instead.
+ */
+ if (*newval == -1)
+ return true;
+
+ /*
+ * We clamp manually-set values to at least 64kB. The maintenance_work_mem
+ * uses a higher minimum value (1MB), so this is OK.
+ */
+ if (*newval < 64)
+ *newval = 64;

I think this needs to be changed as now we don't rely on maintenance_work_mem. Another thing related to this is that I think the default value for logical_decoding_work_mem still seems to be -1. We need to make it to 64MB. I have seen this while debugging memory accounting changes. I think this is the reason why I was not seeing toast related changes being serialized because, in that test, I haven't changed the default value of logical_decoding_work_mem.

2.
+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */

/going modify/going to modify/

3.
+ *
+ * While updating the existing change with detoasted tuple data, we need to
+ * update the memory accounting info, because the change size will differ.
+ * Otherwise the accounting may get out of sync, triggering serialization
+ * at unexpected times.
+ *
+ * We simply subtract size of the change before rejiggering the tuple, and
+ * then adding the new size. This makes it look like the change was removed
+ * and then added back, except it only tweaks the accounting info.
+ *
+ * In particular it can't trigger serialization, which would be pointless
+ * anyway as it happens during commit processing right before handing
+ * the change to the output plugin.
*/
static void
ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
if (txn->toast_hash == NULL)
return;

+ /*
+ * We're going modify the size of the change, so to make sure the
+ * accounting is correct we'll make it look like we're removing the
+ * change now (with the old size), and then re-add it at the end.
+ */
+ ReorderBufferChangeMemoryUpdate(rb, change, false);

It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem. If possible, can you explain it with an example?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

10 октября 2019 г., 13:27:49

On Thu, Oct 3, 2019 at 1:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have attempted to test the performance of (Stream + Spill) vs
> (Stream + BGW pool) and I can see the similar gain what Alexey had
> shown[1].
>
> In addition to this, I have rebased the latest patchset [2] without
> the two-phase logical decoding patch set.
>
> Test results:
> I have repeated the same test as Alexy[1] for 1kk and 1kk data and
> here is my result
> Stream + Spill
> N           time on master(sec)   Total xact time (sec)
> 1kk               6                               21
> 3kk             18                               55
>
> Stream + BGW pool
> N          time on master(sec)  Total xact time (sec)
> 1kk              6                              13
> 3kk            19                              35
>
> Patch details:
> All the patches are the same as posted on [2] except
> 1. 0006-Gracefully-handle-concurrent-aborts-of-uncommitted -> I have
> removed the handling of error which is specific for 2PC

Here[1], I mentioned that I have removed the 2PC changes from
this[0006] patch but mistakenly I attached the original patch itself
instead of the modified version. So attaching the modified version of
only this patch other patches are the same.

> 2. 0007-Implement-streaming-mode-in-ReorderBuffer -> Rebased without 2PC
> 3. 0009-Extend-the-concurrent-abort-handling-for-in-progress -> New
> patch to handle concurrent abort error for the in-progress transaction
> and also add handling for the sub transaction's abort.
> 4. v3-0014-BGWorkers-pool-for-streamed-transactions-apply -> Rebased
> Alexey's patch

[1] https://www.postgresql.org/message-id/CAFiTN-vHoksqvV4BZ0479NhugGe4QHq_ezngNdDd-YRQ_2cwug%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

0006-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

13 октября 2019 г., 09:54:50

On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
>> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> >wrote:
>> >
>> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
>> >> >
>> >> >On further testing, I found that the patch seems to have problems with
>> >> >toast.  Consider below scenario:
>> >> >Session-1
>> >> >Create table large_text(t1 text);
>> >> >INSERT INTO large_text
>> >> >SELECT (SELECT string_agg('x', ',')
>> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
>> >> >
>> >> >Session-2
>> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
>> >> >'test_decoding');
>> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
>> >> >*--kaboom*
>> >> >
>> >> >The second statement in Session-2 leads to a crash.
>> >> >
>> >>
>> >> OK, thanks for the report - will investigate.
>> >>
>> >
>> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
>> >+ /* Check we're not mixing changes from different transactions. */
>> >+ Assert(change->txn == txn);
>> >
>>
>> Can you still reproduce this issue with the patch I sent on 28/9? I have
>> been unable to trigger the failure, and it seems pretty similar to the
>> failure you reported (and I fixed) on 28/9.
>
>
> Yes, it seems we need a similar change in ReorderBufferAddNewTupleCids.  I think in session-2 you need to create
replicationslot before creating table in session-1 to see this problem. 
>
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -2196,6 +2196,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb, TransactionId xid,
>         change->data.tuplecid.cmax = cmax;
>         change->data.tuplecid.combocid = combocid;
>         change->lsn = lsn;
> +       change->txn = txn;
>         change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
>         dlist_push_tail(&txn->tuplecids, &change->node);
>
> Few more comments:
> -----------------------------------
> 1.
> +static bool
> +check_logical_decoding_work_mem(int *newval, void **extra, GucSource source)
> +{
> + /*
> + * -1 indicates fallback.
> + *
> + * If we haven't yet changed the boot_val default of -1, just let it be.
> + * logical decoding will look to maintenance_work_mem instead.
> + */
> + if (*newval == -1)
> + return true;
> +
> + /*
> + * We clamp manually-set values to at least 64kB. The maintenance_work_mem
> + * uses a higher minimum value (1MB), so this is OK.
> + */
> + if (*newval < 64)
> + *newval = 64;
>
> I think this needs to be changed as now we don't rely on maintenance_work_mem.  Another thing related to this is that
Ithink the default value for logical_decoding_work_mem still seems to be -1.  We need to make it to 64MB.  I have seen
thiswhile debugging memory accounting changes.  I think this is the reason why I was not seeing toast related changes
beingserialized because, in that test, I haven't changed the default value of logical_decoding_work_mem. 
>
> 2.
> + /*
> + * We're going modify the size of the change, so to make sure the
> + * accounting is correct we'll make it look like we're removing the
> + * change now (with the old size), and then re-add it at the end.
> + */
>
>
> /going modify/going to modify/
>
> 3.
> + *
> + * While updating the existing change with detoasted tuple data, we need to
> + * update the memory accounting info, because the change size will differ.
> + * Otherwise the accounting may get out of sync, triggering serialization
> + * at unexpected times.
> + *
> + * We simply subtract size of the change before rejiggering the tuple, and
> + * then adding the new size. This makes it look like the change was removed
> + * and then added back, except it only tweaks the accounting info.
> + *
> + * In particular it can't trigger serialization, which would be pointless
> + * anyway as it happens during commit processing right before handing
> + * the change to the output plugin.
>   */
>  static void
>  ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
>   if (txn->toast_hash == NULL)
>   return;
>
> + /*
> + * We're going modify the size of the change, so to make sure the
> + * accounting is correct we'll make it look like we're removing the
> + * change now (with the old size), and then re-add it at the end.
> + */
> + ReorderBufferChangeMemoryUpdate(rb, change, false);
>
> It is not very clear why this change is required.  Basically, this is done at commit time after which actually we
shouldn'tattempt to spill these changes.  This is mentioned in comments as well, but it is not clear if that is the
case,then how and when accounting can create a problem.  If possible, can you explain it with an example? 
>
IIUC, we are keeping the track of the memory in ReorderBuffer which is
common across the transactions.  So even if this transaction is
committing and will not spill to dis but we need to keep the memory
accounting correct for the future changes in other transactions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

13 октября 2019 г., 14:46:25

On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 3.
> > + *
> > + * While updating the existing change with detoasted tuple data, we need to
> > + * update the memory accounting info, because the change size will differ.
> > + * Otherwise the accounting may get out of sync, triggering serialization
> > + * at unexpected times.
> > + *
> > + * We simply subtract size of the change before rejiggering the tuple, and
> > + * then adding the new size. This makes it look like the change was removed
> > + * and then added back, except it only tweaks the accounting info.
> > + *
> > + * In particular it can't trigger serialization, which would be pointless
> > + * anyway as it happens during commit processing right before handing
> > + * the change to the output plugin.
> >   */
> >  static void
> >  ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> >   if (txn->toast_hash == NULL)
> >   return;
> >
> > + /*
> > + * We're going modify the size of the change, so to make sure the
> > + * accounting is correct we'll make it look like we're removing the
> > + * change now (with the old size), and then re-add it at the end.
> > + */
> > + ReorderBufferChangeMemoryUpdate(rb, change, false);
> >
> > It is not very clear why this change is required.  Basically, this is done at commit time after which actually we
shouldn'tattempt to spill these changes.  This is mentioned in comments as well, but it is not clear if that is the
case,then how and when accounting can create a problem.  If possible, can you explain it with an example? 
> >
> IIUC, we are keeping the track of the memory in ReorderBuffer which is
> common across the transactions.  So even if this transaction is
> committing and will not spill to dis but we need to keep the memory
> accounting correct for the future changes in other transactions.
>

You are right.  I somehow missed that we need to keep the size
computation in sync even during commit for other in-progress
transactions in the ReorderBuffer.  You can ignore this point or maybe
slightly adjust the comment to make it explicit.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Craig Ringer

Дата:

14 октября 2019 г., 04:21:20

On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Oct 13, 2019 at 12:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Oct 3, 2019 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 3.
> > + *
> > + * While updating the existing change with detoasted tuple data, we need to
> > + * update the memory accounting info, because the change size will differ.
> > + * Otherwise the accounting may get out of sync, triggering serialization
> > + * at unexpected times.
> > + *
> > + * We simply subtract size of the change before rejiggering the tuple, and
> > + * then adding the new size. This makes it look like the change was removed
> > + * and then added back, except it only tweaks the accounting info.
> > + *
> > + * In particular it can't trigger serialization, which would be pointless
> > + * anyway as it happens during commit processing right before handing
> > + * the change to the output plugin.
> > */
> > static void
> > ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > @@ -3023,6 +3281,13 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > if (txn->toast_hash == NULL)
> > return;
> >
> > + /*
> > + * We're going modify the size of the change, so to make sure the
> > + * accounting is correct we'll make it look like we're removing the
> > + * change now (with the old size), and then re-add it at the end.
> > + */
> > + ReorderBufferChangeMemoryUpdate(rb, change, false);
> >
> > It is not very clear why this change is required. Basically, this is done at commit time after which actually we shouldn't attempt to spill these changes. This is mentioned in comments as well, but it is not clear if that is the case, then how and when accounting can create a problem. If possible, can you explain it with an example?
> >
> IIUC, we are keeping the track of the memory in ReorderBuffer which is
> common across the transactions. So even if this transaction is
> committing and will not spill to dis but we need to keep the memory
> accounting correct for the future changes in other transactions.
>

You are right. I somehow missed that we need to keep the size
computation in sync even during commit for other in-progress
transactions in the ReorderBuffer. You can ignore this point or maybe
slightly adjust the comment to make it explicit.

Does anyone object if we add the reorder buffer total size & in-memory size to struct WalSnd too, so we can report it in pg_stat_replication?

I can follow up with a patch to add on top of this one if you think it's reasonable. I'll also take the opportunity to add a number of tracepoints across the walsender and logical decoding, since right now it's very opaque in production systems ... and everyone just LOVES hunting down debug syms and attaching gdb to production DBs.

Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

14 октября 2019 г., 05:11:23

On Mon, Oct 14, 2019 at 6:51 AM Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On Sun, 13 Oct 2019 at 19:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>
>
> Does anyone object if we add the reorder buffer total size & in-memory size to struct WalSnd too, so we can report it
inpg_stat_replication? 
>

There is already a patch
(0011-Track-statistics-for-streaming-spilling) in this series posted
by Tomas[1] which tracks important statistics in WalSnd which I think
are good enough.  Have you checked that?  I am not sure if adding
additional size will help, but I might be missing something.

> I can follow up with a patch to add on top of this one if you think it's reasonable. I'll also take the opportunity
toadd a number of tracepoints across the walsender and logical decoding, since right now it's very opaque in production
systems... and everyone just LOVES hunting down debug syms and attaching gdb to production DBs. 
>

Sure, adding tracepoints can be helpful, but isn't it better to start
that as a separate thread?

[1] - https://www.postgresql.org/message-id/20190928190917.hrpknmq76v3ts3lj%40development

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

14 октября 2019 г., 12:39:02

On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Oct 02, 2019 at 04:27:30AM +0530, Amit Kapila wrote:
> >On Tue, Oct 1, 2019 at 7:21 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> >wrote:
> >
> >> On Tue, Oct 01, 2019 at 06:55:52PM +0530, Amit Kapila wrote:
> >> >
> >> >On further testing, I found that the patch seems to have problems with
> >> >toast.  Consider below scenario:
> >> >Session-1
> >> >Create table large_text(t1 text);
> >> >INSERT INTO large_text
> >> >SELECT (SELECT string_agg('x', ',')
> >> >FROM generate_series(1, 1000000)) FROM generate_series(1, 1000);
> >> >
> >> >Session-2
> >> >SELECT * FROM pg_create_logical_replication_slot('regression_slot',
> >> >'test_decoding');
> >> >SELECT * FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL);
> >> >*--kaboom*
> >> >
> >> >The second statement in Session-2 leads to a crash.
> >> >
> >>
> >> OK, thanks for the report - will investigate.
> >>
> >
> >It was an assertion failure in ReorderBufferCleanupTXN at below line:
> >+ /* Check we're not mixing changes from different transactions. */
> >+ Assert(change->txn == txn);
> >
>
> Can you still reproduce this issue with the patch I sent on 28/9? I have
> been unable to trigger the failure, and it seems pretty similar to the
> failure you reported (and I fixed) on 28/9.
>
> >> >Other than that, I am not sure if the changes related to spill to disk
> >> >after logical_decoding_work_mem works for toast table as I couldn't hit
> >> >that code for toast table case, but I might be missing something.  As
> >> >mentioned previously, I feel there should be some way to test whether this
> >> >patch works for the cases it claims to work.  As of now, I have to check
> >> >via debugging.  Let me know if there is any way, I can test this.
> >> >
> >>
> >> That's one of the reasons why I proposed to move the statistics (which
> >> say how many transactions / bytes were spilled to disk) from a later
> >> patch in the series. I don't think there's a better way.
> >>
> >>
> >I like that idea, but I think you need to split that patch to only get the
> >stats related to the spill.  It would be easier to review if you can
> >prepare that atop of
> >0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer.
> >
>
> Sure, I wasn't really proposing to adding all stats from that patch,
> including those related to streaming.  We need to extract just those
> related to spilling. And yes, it needs to be moved right after 0001.
>
I have extracted the spilling related code to a separate patch on top
of 0001.  I have also fixed some bugs and review comments and attached
as a separate patch.  Later I can merge it to the main patch if you
agree with the changes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Oct 18, 2019 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Oct 14, 2019 at 3:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Oct 3, 2019 at 4:03 AM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > >
> > > Sure, I wasn't really proposing to adding all stats from that patch,
> > > including those related to streaming.  We need to extract just those
> > > related to spilling. And yes, it needs to be moved right after 0001.
> > >
> > I have extracted the spilling related code to a separate patch on top
> > of 0001.  I have also fixed some bugs and review comments and attached
> > as a separate patch.  Later I can merge it to the main patch if you
> > agree with the changes.
> >
>
> Few comments
> -------------------------
> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer
> 1.
> + {
> + {"logical_decoding_work_mem", PGC_USERSET, RESOURCES_MEM,
> + gettext_noop("Sets the maximum memory to be used for logical decoding."),
> + gettext_noop("This much memory can be used by each internal "
> + "reorder buffer before spilling to disk or streaming."),
> + GUC_UNIT_KB
> + },
>
> I think we can remove 'or streaming' from above sentence for now.  We
> can add it later with later patch where streaming will be allowed.
Done
>
> 2.
> @@ -206,6 +206,18 @@ CREATE SUBSCRIPTION <replaceable
> class="parameter">subscription_name</replaceabl
>           </para>
>          </listitem>
>         </varlistentry>
> +
> +       <varlistentry>
> +        <term><literal>work_mem</literal> (<type>integer</type>)</term>
> +        <listitem>
> +         <para>
> +          Limits the amount of memory used to decode changes on the
> +          publisher.  If not specified, the publisher will use the default
> +          specified by <varname>logical_decoding_work_mem</varname>. When
> +          needed, additional data are spilled to disk.
> +         </para>
> +        </listitem>
> +       </varlistentry>
>
> It is not clear why we need this parameter at least with this patch?
> I have raised this multiple times [1][2].

I have moved it out as a separate patch (0003) so that if we need that
we need this for the streaming transaction then we can keep this.
>
> bugs_and_review_comments_fix
> 1.
> },
>   &logical_decoding_work_mem,
> - -1, -1, MAX_KILOBYTES,
> - check_logical_decoding_work_mem, NULL, NULL
> + 65536, 64, MAX_KILOBYTES,
> + NULL, NULL, NULL
>
> I think the default value should be 1MB similar to
> maintenance_work_mem.  The same was true before this change.
default value for maintenance_work_mem is also 64MB. Did you mean min value?
>
> 2. -#logical_decoding_work_mem = 64MB # min 1MB, or -1 to use
> maintenance_work_mem
> +i#logical_decoding_work_mem = 64MB # min 64kB
>
> It seems the 'i' is a leftover character in the above change.  Also,
> change the default value considering the previous point.
oops, fixed.
>
> 3.
> @@ -2479,7 +2480,7 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>
>   /* update the statistics */
>   rb->spillCount += 1;
> - rb->spillTxns += txn->serialized ? 1 : 0;
> + rb->spillTxns += txn->serialized ? 0 : 1;
>   rb->spillBytes += size;
>
> Why is this change required?  Shouldn't we increase the spillTxns
> count only when the txn is serialized?
Already agreed in previous mail so added comments
>
> 0002-Track-statistics-for-spilling
> 1.
> +    <row>
> +     <entry><structfield>spill_txns</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Number of transactions spilled to disk after the memory used by
> +      logical decoding exceeds <literal>logical_work_mem</literal>. The
> +      counter gets incremented both for toplevel transactions and
> +      subtransactions.
> +      </entry>
> +    </row>
>
> The parameter name is wrong here. /logical_work_mem/logical_decoding_work_mem
done
>
> 2.
> +    <row>
> +     <entry><structfield>spill_txns</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Number of transactions spilled to disk after the memory used by
> +      logical decoding exceeds <literal>logical_work_mem</literal>. The
> +      counter gets incremented both for toplevel transactions and
> +      subtransactions.
> +      </entry>
> +    </row>
> +    <row>
> +     <entry><structfield>spill_count</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Number of times transactions were spilled to disk. Transactions
> +      may get spilled repeatedly, and this counter gets incremented on every
> +      such invocation.
> +      </entry>
> +    </row>
> +    <row>
> +     <entry><structfield>spill_bytes</structfield></entry>
> +     <entry><type>integer</type></entry>
> +     <entry>Amount of decoded transaction data spilled to disk.
> +      </entry>
> +    </row>
>
> In all the above cases, the explanation text starts immediately after
> <entry> tag, but the general coding practice is to start from the next
> line, see the explanation of nearby parameters.
It seems it's mixed, for example, you can see
   <entry>Timeline number of last write-ahead log location received and
      flushed to disk, the initial value of this field being the timeline
      number of the first log location used when WAL receiver is started
     </entry>

or
    <entry>Timeline number of last write-ahead log location received and
      flushed to disk, the initial value of this field being the timeline
      number of the first log location used when WAL receiver is started
     </entry>

>
> It seems these parameters are added in pg-stat-wal-receiver-view in
> the docs, but in code, it is present as part of pg_stat_replication.
> It seems doc needs to be updated.  Am, I missing something?
Fixed
>
> 3.
> ReorderBufferSerializeTXN()
> {
> ..
> /* update the statistics */
> rb->spillCount += 1;
> rb->spillTxns += txn->serialized ? 0 : 1;
> rb->spillBytes += size;
>
> Assert(spilled == txn->nentries_mem);
> Assert(dlist_is_empty(&txn->changes));
> txn->nentries_mem = 0;
> txn->serialized = true;
> ..
> }
>
> I am not able to understand the above code.  We are setting the
> serialized parameter a few lines after we check it and increment the
> spillTxns count. Can you please explain it?
>
> Also, isn't spillTxns count bit confusing, because in some cases it
> will include subtransactions and other cases (where the largest picked
> transaction is a subtransaction) it won't include it?
>
Already discussed in the last mail.

I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> I think the patch should do the simplest thing possible, i.e. what it
> does today. Otherwise we'll never get it committed.
>
I found a couple of crashes while reviewing and testing flushing of
open transaction data:
Issue 1:
#0  0x00007f22c5722337 in raise () from /lib64/libc.so.6
#1  0x00007f22c5723a28 in abort () from /lib64/libc.so.6
#2  0x0000000000ec5390 in ExceptionalCondition
(conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
"FailedAssertion",
    fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
lineNumber=458) at assert.c:54
#3  0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
off=64) at ../../../../src/include/lib/ilist.h:458
#4  0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
oldestRunningXid=3834) at reorderbuffer.c:1966
#5  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
buf=0x7ffcbc26dc50) at decode.c:332
#6  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x19af990,
record=0x19afc50) at decode.c:121
#7  0x0000000000b7109e in XLogSendLogical () at walsender.c:2845
#8  0x0000000000b6f5e4 in WalSndLoop (send_data=0xb70f77
<XLogSendLogical>) at walsender.c:2199
#9  0x0000000000b6c7e1 in StartLogicalReplication (cmd=0x1983168) at
walsender.c:1128
#10 0x0000000000b6da6f in exec_replication_command
(cmd_string=0x18f70a0 "START_REPLICATION SLOT \"sub1\" LOGICAL 0/0
(proto_version '1', publication_names '\"pub1\"')")
    at walsender.c:1545

Issue 2:
#0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
#1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
#2  0x0000000000ec4e1d in ExceptionalCondition
(conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
errorType=0x10ea284 "FailedAssertion",
    fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
#3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:3052
#4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
txn=0x2bafb08) at reorderbuffer.c:1318
#5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
txn=0x2b9d778) at reorderbuffer.c:1257
#6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
oldestRunningXid=3835) at reorderbuffer.c:1973
#7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
buf=0x7ffcbc74cc00) at decode.c:332
#8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
record=0x2b67990) at decode.c:121
#9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845

These failures come randomly.
I'm not able to reproduce this issue with simple test case.
I have attached the test case which I used to test.
I will further try to find a scenario which could reproduce consistently.
Posting it so that it can help someone in identifying the problem
parallelly through code review by experts.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Вложения

mix_data_test.c

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

30 октября 2019 г., 10:48:59

On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:
>
I have noticed one more problem in the logic of setting the logical
decoding work mem from the create subscription command.  Suppose in
subscription command we don't give the work mem then it sends the
garbage value to the walsender and the walsender overwrite its value
with the garbage value.  After investigating a bit I have found the
reason for the same.

@@ -406,6 +406,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
  appendStringInfo(&cmd, "proto_version '%u'",
  options->proto.logical.proto_version);

+ appendStringInfo(&cmd, ", work_mem '%d'",
+ options->proto.logical.work_mem);

I think the problem is we are unconditionally sending the work_mem as
part of the CREATE REPLICATION SLOT, without checking whether it's
valid or not.

--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -71,6 +71,7 @@ GetSubscription(Oid subid, bool missing_ok)
  sub->name = pstrdup(NameStr(subform->subname));
  sub->owner = subform->subowner;
  sub->enabled = subform->subenabled;
+ sub->workmem = subform->subworkmem;

Another problem is that there is no handling if the subform->subworkmem is NULL.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Kuntal Ghosh

Дата:

04 ноября 2019 г., 12:13:38

Hello hackers,

I've done some performance testing of this feature. Following is my
test case (taken from an earlier thread):

postgres=# CREATE TABLE large_test (num1 bigint, num2 double
precision, num3 double precision);
postgres=# \timing on
postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1,
num2, num3) SELECT round(random()*10), random(), random()*142 FROM
generate_series(1, 1000000) s(i);

I've kept the publisher and subscriber in two different system.

HEAD:
With 1000000 tuples,
Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245
With 10000000 tuples (10 times more),
Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442

With the memory accounting patch, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time:
9648.223 ms (00:09.648), Spill count: 2315
logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time:
9895.161 ms (00:09.895), Spill count 3
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time:
105761.978 ms (01:45.762), Spill count: 23149
logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time:
89985.342 ms (01:29.985), Spill count: 23

With logical decoding of in-progress transactions patch and with
streaming on, following are the performance results:
With 100000 tuples,
logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time:
20779.601 ms (00:20.780)
logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time:
9559.953 ms (00:09.560)
With 1000000 tuples (10 times more),
logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time:
196261.892 ms (03:16.262)
logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time:
90079.286 ms (01:30.079)
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

04 ноября 2019 г., 13:02:33

On Mon, Nov 4, 2019 at 2:43 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> Hello hackers,
>
> I've done some performance testing of this feature. Following is my
> test case (taken from an earlier thread):
>
> postgres=# CREATE TABLE large_test (num1 bigint, num2 double
> precision, num3 double precision);
> postgres=# \timing on
> postgres=# EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1,
> num2, num3) SELECT round(random()*10), random(), random()*142 FROM
> generate_series(1, 1000000) s(i);
>
> I've kept the publisher and subscriber in two different system.
>
> HEAD:
> With 1000000 tuples,
> Execution Time: 2576.821 ms, Time: 9.632.158 ms (00:09.632), Spill count: 245
> With 10000000 tuples (10 times more),
> Execution Time: 30359.509 ms, Time: 95261.024 ms (01:35.261), Spill count: 2442
>
> With the memory accounting patch, following are the performance results:
> With 100000 tuples,
> logical_decoding_work_mem=64kB, Execution Time: 2414.371 ms, Time:
> 9648.223 ms (00:09.648), Spill count: 2315
> logical_decoding_work_mem=64MB, Execution Time: 2477.830 ms, Time:
> 9895.161 ms (00:09.895), Spill count 3
> With 1000000 tuples (10 times more),
> logical_decoding_work_mem=64kB, Execution Time: 38259.227 ms, Time:
> 105761.978 ms (01:45.762), Spill count: 23149
> logical_decoding_work_mem=64MB, Execution Time: 24624.639 ms, Time:
> 89985.342 ms (01:29.985), Spill count: 23
>
> With logical decoding of in-progress transactions patch and with
> streaming on, following are the performance results:
> With 100000 tuples,
> logical_decoding_work_mem=64kB, Execution Time: 2674.034 ms, Time:
> 20779.601 ms (00:20.780)
> logical_decoding_work_mem=64MB, Execution Time: 2062.404 ms, Time:
> 9559.953 ms (00:09.560)
> With 1000000 tuples (10 times more),
> logical_decoding_work_mem=64kB, Execution Time: 26949.588 ms, Time:
> 196261.892 ms (03:16.262)
> logical_decoding_work_mem=64MB, Execution Time: 27084.403 ms, Time:
> 90079.286 ms (01:30.079)
So your result shows that with "streaming on", performance is
degrading?  By any chance did you try to see where is the bottleneck?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Kuntal Ghosh

Дата:

04 ноября 2019 г., 13:05:12

On Mon, Nov 4, 2019 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> So your result shows that with "streaming on", performance is
> degrading?  By any chance did you try to see where is the bottleneck?
>
Right. But, as we increase the logical_decoding_work_mem, the
performance improves. I've not analyzed the bottleneck yet. I'm
looking into the same.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

vignesh C

Дата:

04 ноября 2019 г., 13:16:48

On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
> >
>
> I was wondering whether we have checked the code coverage after this
> patch?  Previously, the existing tests seem to be covering most parts
> of the function ReorderBufferSerializeTXN [1].  After this patch, the
> timing to call ReorderBufferSerializeTXN will change, so that might
> impact the testing of the same.  If it is already covered, then I
> would like to either add a new test or extend existing test with the
> help of new spill counters.  If it is not getting covered, then we
> need to think of extending the existing test or write a new test to
> cover the function ReorderBufferSerializeTXN.
>
I have run the tests with coverage and found that
ReorderBufferSerializeTXN is not being hit.
The reason it is not being hit is because of the following check in
ReorderBufferCheckMemoryLimit:
    /* bail out if we haven't exceeded the memory limit */
    if (rb->size < logical_decoding_work_mem * 1024L)
        return;
Previously the tests from contrib/test_decoding could hit
ReorderBufferSerializeTXN function.
I'm checking if we can modify the test or add new test to hit
ReorderBufferSerializeTXN function.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

04 ноября 2019 г., 14:51:57

On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > I think the patch should do the simplest thing possible, i.e. what it
> > does today. Otherwise we'll never get it committed.
> >
> I found a couple of crashes while reviewing and testing flushing of
> open transaction data:
>

Thanks for doing these tests.  However, I don't think these issues are
anyway related to this patch.  It seems to be base code issues
manifested by this patch.  See my analysis below.

> Issue 1:
> #0  0x00007f22c5722337 in raise () from /lib64/libc.so.6
> #1  0x00007f22c5723a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec5390 in ExceptionalCondition
> (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
> "FailedAssertion",
>     fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
> lineNumber=458) at assert.c:54
> #3  0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
> off=64) at ../../../../src/include/lib/ilist.h:458
> #4  0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
> oldestRunningXid=3834) at reorderbuffer.c:1966
> #5  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
> buf=0x7ffcbc26dc50) at decode.c:332
>

This seems to be the problem of base code where we abort immediately
after serializing the changes because in that case, the changes list
will be empty.  I think you can try to reproduce it via the debugger
or by hacking the code such that it serializes after every change and
then if you abort after one change, it should hit this problem.

>
> Issue 2:
> #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> #2  0x0000000000ec4e1d in ExceptionalCondition
> (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> errorType=0x10ea284 "FailedAssertion",
>     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:3052
> #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
> txn=0x2bafb08) at reorderbuffer.c:1318
> #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> txn=0x2b9d778) at reorderbuffer.c:1257
> #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> oldestRunningXid=3835) at reorderbuffer.c:1973
>

This seems to be again the problem with base code as we don't update
the final_lsn for subtransactions during ReorderBufferAbortOld.  This
can also be reproduced with some hacking in code or via debugger in a
similar way as explained for the previous problem but with a
difference that there must be subtransaction involved in this case.

> #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> buf=0x7ffcbc74cc00) at decode.c:332
> #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> record=0x2b67990) at decode.c:121
> #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
>
> These failures come randomly.
> I'm not able to reproduce this issue with simple test case.

Yeah, it appears to be difficult to reproduce unless you hack the code
to serialize every change or use debugger to forcefully flush the
changes every time.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

05 ноября 2019 г., 08:38:02

On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I think the patch should do the simplest thing possible, i.e. what it
> > > does today. Otherwise we'll never get it committed.
> > >
> > I found a couple of crashes while reviewing and testing flushing of
> > open transaction data:
> >
>
> Thanks for doing these tests.  However, I don't think these issues are
> anyway related to this patch.  It seems to be base code issues
> manifested by this patch.  See my analysis below.
>
> > Issue 1:
> > #0  0x00007f22c5722337 in raise () from /lib64/libc.so.6
> > #1  0x00007f22c5723a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec5390 in ExceptionalCondition
> > (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
> > "FailedAssertion",
> >     fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
> > lineNumber=458) at assert.c:54
> > #3  0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
> > off=64) at ../../../../src/include/lib/ilist.h:458
> > #4  0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
> > oldestRunningXid=3834) at reorderbuffer.c:1966
> > #5  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
> > buf=0x7ffcbc26dc50) at decode.c:332
> >
>
> This seems to be the problem of base code where we abort immediately
> after serializing the changes because in that case, the changes list
> will be empty.  I think you can try to reproduce it via the debugger
> or by hacking the code such that it serializes after every change and
> then if you abort after one change, it should hit this problem.
>
I think you might need to kill the server after all changes are
serialized otherwise normal abort will hit the ReorderBufferAbort and
that will remove your ReorderBufferTXN entry and you will never hit
this case.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

vignesh C

Дата:

06 ноября 2019 г., 09:03:43

On Mon, Nov 4, 2019 at 3:46 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, Oct 24, 2019 at 7:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have merged bugs_and_review_comments_fix.patch changes to 0001 and 0002.
> > >
> >
> > I was wondering whether we have checked the code coverage after this
> > patch?  Previously, the existing tests seem to be covering most parts
> > of the function ReorderBufferSerializeTXN [1].  After this patch, the
> > timing to call ReorderBufferSerializeTXN will change, so that might
> > impact the testing of the same.  If it is already covered, then I
> > would like to either add a new test or extend existing test with the
> > help of new spill counters.  If it is not getting covered, then we
> > need to think of extending the existing test or write a new test to
> > cover the function ReorderBufferSerializeTXN.
> >
> I have run the tests with coverage and found that
> ReorderBufferSerializeTXN is not being hit.
> The reason it is not being hit is because of the following check in
> ReorderBufferCheckMemoryLimit:
>     /* bail out if we haven't exceeded the memory limit */
>     if (rb->size < logical_decoding_work_mem * 1024L)
>         return;
> Previously the tests from contrib/test_decoding could hit
> ReorderBufferSerializeTXN function.
> I'm checking if we can modify the test or add new test to hit
> ReorderBufferSerializeTXN function.

I have made one change to the configuration file in
contrib/test_decoding directory, with that the coverage seems to be
fine. I have seen that the coverage is almost like the code before
applying the patch. I have attached the test change and the coverage
report for reference. Coverage report includes the core logical work
memory files for base code and by applying
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
0002-Track-statistics-for-spilling patches.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

vignesh C

Дата:

06 ноября 2019 г., 14:03:12

On Mon, Nov 4, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 30, 2019 at 9:38 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Oct 22, 2019 at 10:52 PM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > I think the patch should do the simplest thing possible, i.e. what it
> > > does today. Otherwise we'll never get it committed.
> > >
> > I found a couple of crashes while reviewing and testing flushing of
> > open transaction data:
> >
>
> Thanks for doing these tests.  However, I don't think these issues are
> anyway related to this patch.  It seems to be base code issues
> manifested by this patch.  See my analysis below.
>
> > Issue 1:
> > #0  0x00007f22c5722337 in raise () from /lib64/libc.so.6
> > #1  0x00007f22c5723a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec5390 in ExceptionalCondition
> > (conditionName=0x10ea814 "!dlist_is_empty(head)", errorType=0x10ea804
> > "FailedAssertion",
> >     fileName=0x10ea7e0 "../../../../src/include/lib/ilist.h",
> > lineNumber=458) at assert.c:54
> > #3  0x0000000000b4fb91 in dlist_tail_element_off (head=0x19e4db8,
> > off=64) at ../../../../src/include/lib/ilist.h:458
> > #4  0x0000000000b546d0 in ReorderBufferAbortOld (rb=0x191b6b0,
> > oldestRunningXid=3834) at reorderbuffer.c:1966
> > #5  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x19af990,
> > buf=0x7ffcbc26dc50) at decode.c:332
> >
>
> This seems to be the problem of base code where we abort immediately
> after serializing the changes because in that case, the changes list
> will be empty.  I think you can try to reproduce it via the debugger
> or by hacking the code such that it serializes after every change and
> then if you abort after one change, it should hit this problem.
>
> >
> > Issue 2:
> > #0  0x00007f1d7ddc4337 in raise () from /lib64/libc.so.6
> > #1  0x00007f1d7ddc5a28 in abort () from /lib64/libc.so.6
> > #2  0x0000000000ec4e1d in ExceptionalCondition
> > (conditionName=0x10ead30 "txn->final_lsn != InvalidXLogRecPtr",
> > errorType=0x10ea284 "FailedAssertion",
> >     fileName=0x10ea2d0 "reorderbuffer.c", lineNumber=3052) at assert.c:54
> > #3  0x0000000000b577e0 in ReorderBufferRestoreCleanup (rb=0x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:3052
> > #4  0x0000000000b52b1c in ReorderBufferCleanupTXN (rb=0y x2ae36b0,
> > txn=0x2bafb08) at reorderbuffer.c:1318
> > #5  0x0000000000b5279d in ReorderBufferCleanupTXN (rb=0x2ae36b0,
> > txn=0x2b9d778) at reorderbuffer.c:1257
> > #6  0x0000000000b5475c in ReorderBufferAbortOld (rb=0x2ae36b0,
> > oldestRunningXid=3835) at reorderbuffer.c:1973
> >
>
> This seems to be again the problem with base code as we don't update
> the final_lsn for subtransactions during ReorderBufferAbortOld.  This
> can also be reproduced with some hacking in code or via debugger in a
> similar way as explained for the previous problem but with a
> difference that there must be subtransaction involved in this case.
>
> > #7  0x0000000000b3ca03 in DecodeStandbyOp (ctx=0x2b676d0,
> > buf=0x7ffcbc74cc00) at decode.c:332
> > #8  0x0000000000b3c208 in LogicalDecodingProcessRecord (ctx=0x2b676d0,
> > record=0x2b67990) at decode.c:121
> > #9  0x0000000000b70b2b in XLogSendLogical () at walsender.c:2845
> >
> > These failures come randomly.
> > I'm not able to reproduce this issue with simple test case.
>
> Yeah, it appears to be difficult to reproduce unless you hack the code
> to serialize every change or use debugger to forcefully flush the
> changes every time.
>

Thanks Amit for your analysis, I was able to reproduce the above issue
consistently by making some code changes and with help of debugger. I
did one change so that it flushes every time instead of flushing after
the buffer size exceeds the logical_decoding_work_mem, attached one of
the transactions and called abort. When the server restarts after
abort, this problem occurs consistently. I could reproduce the issue
with base code also. It seems like this issue is not an issue of
0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer patch and
exists from base code. I will post the issue in hackers with details.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

07 ноября 2019 г., 12:49:45

On Wed, Nov 6, 2019 at 11:33 AM vignesh C <vignesh21@gmail.com> wrote:
>
> I have made one change to the configuration file in
> contrib/test_decoding directory, with that the coverage seems to be
> fine. I have seen that the coverage is almost like the code before
> applying the patch. I have attached the test change and the coverage
> report for reference. Coverage report includes the core logical work
> memory files for base code and by applying
> 0001-Add-logical_decoding_work_mem-to-limit-ReorderBuffer and
> 0002-Track-statistics-for-spilling patches.
>

Thanks,  I have incorporated your test changes and modified the two
patches.  Please see attached.

Changes:
---------------
1. In guc.c, we should include reorderbuffer.h, not logical.h as we
define logical_decoding_work_mem in earlier.

2.
+ *   To limit the amount of memory used by decoded changes, we track memory
+ *   used at the reorder buffer level (i.e. total amount of memory), and for
+ *   each toplevel transaction. When the total amount of used memory exceeds
+ *   the limit, the toplevel transaction consuming the most memory is then
+ *   serialized to disk.

In the above comments, removed 'toplevel' as we track memory usage for
both toplevel and subtransactions.

3. There were still a few mentions of streaming which I have removed.

4. In the docs, the type for stats spill_* was integer whereas it
should be bigint.

5.
+UpdateSpillStats(LogicalDecodingContext *ctx)
+{
+ ReorderBuffer *rb = ctx->reorder;
+
+ SpinLockAcquire(&MyWalSnd->mutex);
+
+ MyWalSnd->spillTxns = rb->spillTxns;
+ MyWalSnd->spillCount = rb->spillCount;
+ MyWalSnd->spillBytes = rb->spillBytes;
+
+ elog(WARNING, "UpdateSpillStats: updating stats %p %ld %ld %ld",
+ rb, rb->spillTxns, rb->spillCount, rb->spillBytes);

Changed the above elog to DEBUG1 as otherwise it was getting printed
very frequently.  I think we can make it DEBUG2 if we want.

6. There was an extra space in rules.out due to which test was
failing.  I have fixed it.

What do you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Thu, Nov 7, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Some notes before commit:
> --------------------------------------
> 1.
> Commit message need to be changed for the first patch
> -------------------------------------------------------------------------
> A.
> > The memory limit is defined by a new logical_decoding_work_mem GUC, so for example we can do this
>
>     SET logical_decoding_work_mem = '128kB'
>
> > to trigger very aggressive streaming. The minimum value is 64kB.
>
> I think this patch doesn't contain streaming, so we either need to
> reword it or remove it.
>
> B.
> > The logical_decoding_work_mem may be set either in postgresql.conf, in which case it serves as the default for all
publisherson that instance, or when creating the
 
> > subscription, using a work_mem paramemter in the WITH clause (specifies number of kilobytes).
>
> We need to reword this as we have decided to remove the setting from
> the subscription side as of now.
>
> 2. I think we can change the message level in UpdateSpillStats() to DEBUG2.
>

I have made these modifications and additionally ran pgindent.

> 4. I think we can combine both patches and commit as one patch, but it
> is okay to commit them separately as well.
>

I am not sure if this is a good idea, so still kept them as separate.

Tomas, do let me know if you want to commit these or if you have any
comments, otherwise, I will commit these on Tuesday (19-Nov)?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > Few other comments on this patch:
> > > > > > 1.
> > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > > +
> > > > > > + /*
> > > > > > + * Execute the invalidation message locally.
> > > > > > + *
> > > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > > + * about the message itself?
> > > > > > + */
> > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > > + break;
> > > > > >
> > > > > > Here, why are we executing messages individually?  Can't we just
> > > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > > receive any invalidation message?
> > >
> > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > > commit.  Because this is required to add any committed transaction to
> > > the snapshot if it has done any catalog changes.
> > >
> >
> > Hmm, this is also used to build cid hash map (see
> > ReorderBufferBuildTupleCidHash) which we need to use while streaming
> > changes for the in-progress transactions.  So, I think that it would
> > be required earlier (before commit) as well.
> >
> Oh right,  I guess I missed that part.

Attached a new rebased version of the patch set.   I have fixed all
the issues discussed up-thread and agreed upon.

Pending Issues:
1. The default value of the logical_decoding_work_mem is set to 64kb
in test_decoding/logical.conf.  So we need to change the expected
output files for the test decoding module.
2. Need to complete the patch for concurrent abort handling of the
(sub)transaction.  There are some pending issues with the existing
patch[1].

[1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Thu, Nov 21, 2019 at 9:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 20, 2019 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Nov 20, 2019 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Nov 19, 2019 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Nov 18, 2019 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > On Fri, Nov 15, 2019 at 4:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Nov 15, 2019 at 4:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Nov 15, 2019 at 3:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Few other comments on this patch:
> > > > > > > > 1.
> > > > > > > > + case REORDER_BUFFER_CHANGE_INVALIDATION:
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * Execute the invalidation message locally.
> > > > > > > > + *
> > > > > > > > + * XXX Do we need to care about relcacheInitFileInval and
> > > > > > > > + * the other fields added to ReorderBufferChange, or just
> > > > > > > > + * about the message itself?
> > > > > > > > + */
> > > > > > > > + LocalExecuteInvalidationMessage(&change->data.inval.msg);
> > > > > > > > + break;
> > > > > > > >
> > > > > > > > Here, why are we executing messages individually?  Can't we just
> > > > > > > > follow what we do in DecodeCommit which is to record the invalidations
> > > > > > > > in ReorderBufferTXN as we encounter them and then allow them to
> > > > > > > > execute on each REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID.  Is there a
> > > > > > > > reason why we don't do ReorderBufferXidSetCatalogChanges when we
> > > > > > > > receive any invalidation message?
> > > > >
> > > > > I think it's fine to call ReorderBufferXidSetCatalogChanges, only on
> > > > > commit.  Because this is required to add any committed transaction to
> > > > > the snapshot if it has done any catalog changes.
> > > > >
> > > >
> > > > Hmm, this is also used to build cid hash map (see
> > > > ReorderBufferBuildTupleCidHash) which we need to use while streaming
> > > > changes for the in-progress transactions.  So, I think that it would
> > > > be required earlier (before commit) as well.
> > > >
> > > Oh right,  I guess I missed that part.
> >
> > Attached a new rebased version of the patch set.   I have fixed all
> > the issues discussed up-thread and agreed upon.
> >
> > Pending Issues:
> > 1. The default value of the logical_decoding_work_mem is set to 64kb
> > in test_decoding/logical.conf.  So we need to change the expected
> > output files for the test decoding module.
> > 2. Need to complete the patch for concurrent abort handling of the
> > (sub)transaction.  There are some pending issues with the existing
> > patch[1].
> > [1] https://www.postgresql.org/message-id/CAFiTN-ud98kWHCo2YKS55H8rGw3_A7ESyssHwU0xPU6KJsoy6A%40mail.gmail.com
> Apart from these there is one more issue reported upthread[2]
> [2] https://www.postgresql.org/message-id/CAFiTN-vrSNkAfRVrWKe2R1dqFBTubjt%3DDYS%3DjhH%2BjiCoBODdaw%40mail.gmail.com
>
I have rebased the patch on the latest head and also fix the issue of
"concurrent abort handling of the (sub)transaction." and attached as
(v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
the complete patch set.  I have added the version number so that we
can track the changes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Michael Paquier

Дата:

01 декабря 2019 г., 05:28:34

On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> I have rebased the patch on the latest head and also fix the issue of
> "concurrent abort handling of the (sub)transaction." and attached as
> (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> the complete patch set.  I have added the version number so that we
> can track the changes.

The patch has rotten a bit and does not apply anymore.  Could you
please send a rebased version?  I have moved it to next CF, waiting on
author.
--
Michael

Вложения

signature.asc

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

02 декабря 2019 г., 11:31:50

On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> > I have rebased the patch on the latest head and also fix the issue of
> > "concurrent abort handling of the (sub)transaction." and attached as
> > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> > the complete patch set.  I have added the version number so that we
> > can track the changes.
>
> The patch has rotten a bit and does not apply anymore.  Could you
> please send a rebased version?  I have moved it to next CF, waiting on
> author.

I have rebased the patch set on the latest head.

Apart from this, there is one issue reported by my colleague Vignesh.
The issue is that if we use more than two relations in a transaction
then there is an error on standby (no relation map entry for remote
relation ID 16390).  After analyzing I have found that for the
streaming transaction an "is_schema_sent" flag is kept in
ReorderBufferTXN.  And, I think that is done so that we can send the
schema for each transaction stream so that if any subtransaction gets
aborted we don't lose the logical WAL for that schema.  But, this
solution has induced a very basic issue that if a transaction operate
on more than 1 relation then after sending the schema for the first
relation it will mark the flag true and the schema for the subsequent
relations will never be sent.  I am still working on finding a better
solution for this if anyone has any opinion/solution about this feel
free to suggest.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote:
>On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
>> > >
>> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
>> > > > I have rebased the patch on the latest head and also fix the issue of
>> > > > "concurrent abort handling of the (sub)transaction." and attached as
>> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
>> > > > the complete patch set.  I have added the version number so that we
>> > > > can track the changes.
>> > >
>> > > The patch has rotten a bit and does not apply anymore.  Could you
>> > > please send a rebased version?  I have moved it to next CF, waiting on
>> > > author.
>> >
>> > I have rebased the patch set on the latest head.
>> >
>> > Apart from this, there is one issue reported by my colleague Vignesh.
>> > The issue is that if we use more than two relations in a transaction
>> > then there is an error on standby (no relation map entry for remote
>> > relation ID 16390).  After analyzing I have found that for the
>> > streaming transaction an "is_schema_sent" flag is kept in
>> > ReorderBufferTXN.  And, I think that is done so that we can send the
>> > schema for each transaction stream so that if any subtransaction gets
>> > aborted we don't lose the logical WAL for that schema.  But, this
>> > solution has induced a very basic issue that if a transaction operate
>> > on more than 1 relation then after sending the schema for the first
>> > relation it will mark the flag true and the schema for the subsequent
>> > relations will never be sent.
>> >
>>
>> How about keeping a list of top-level xids in each RelationSyncEntry?
>> Basically, whenever we send the schema for any transaction, we note
>> that in RelationSyncEntry and at abort time we can remove xid from the
>> list.  Now, whenever, we check whether to send schema for any
>> operation in a transaction, we will check if our xid is present in
>> that list for a particular RelationSyncEntry and take an action based
>> on that (if xid is present, then we won't send the schema, otherwise,
>> send it).
>The idea make sense to me.  I will try to write a patch for this and test.
>

Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
needs to be in the RelationSyncEntry. In fact, I already have code for
that in my private repository - I thought the patches I sent here do
include this, but apparently I forgot to include this bit :-(

Attached is a rebased patch series, fixing this. It's essentially v2
with a couple of patches (0003, 0008, 0009 and 0012) replacing the
is_schema_sent with correct handling.


0003 - removes an is_schema_sent reference added prematurely (it's added
by a later patch, causing compile failure)

0008 - adds the is_schema_sent back (essentially reverting 0003)

0009 - removes is_schema_sent entirely

0012 - adds the correct handling of schema flags in pgoutput


I don't know what other changes you've made since v2, so this way it
should be possible to just take 0003, 0008, 0009 and 0012 and slip them
in with minimal hassle.

FWIW thanks to everyone (and Amit and Dilip in particular) working on
this patch series.  There's been a lot of great reviews and improvements
since I abandoned this thread for a while. I expect to be able to spend
more time working on this in January.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 декабря 2019 г., 11:04:07

On Sat, Dec 28, 2019 at 9:33 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Dec 10, 2019 at 10:23:19AM +0530, Dilip Kumar wrote:
> >On Tue, Dec 10, 2019 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Mon, Dec 2, 2019 at 2:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> >
> >> > On Sun, Dec 1, 2019 at 7:58 AM Michael Paquier <michael@paquier.xyz> wrote:
> >> > >
> >> > > On Fri, Nov 22, 2019 at 01:18:11PM +0530, Dilip Kumar wrote:
> >> > > > I have rebased the patch on the latest head and also fix the issue of
> >> > > > "concurrent abort handling of the (sub)transaction." and attached as
> >> > > > (v1-0013-Extend-handling-of-concurrent-aborts-for-streamin) along with
> >> > > > the complete patch set.  I have added the version number so that we
> >> > > > can track the changes.
> >> > >
> >> > > The patch has rotten a bit and does not apply anymore.  Could you
> >> > > please send a rebased version?  I have moved it to next CF, waiting on
> >> > > author.
> >> >
> >> > I have rebased the patch set on the latest head.
> >> >
> >> > Apart from this, there is one issue reported by my colleague Vignesh.
> >> > The issue is that if we use more than two relations in a transaction
> >> > then there is an error on standby (no relation map entry for remote
> >> > relation ID 16390).  After analyzing I have found that for the
> >> > streaming transaction an "is_schema_sent" flag is kept in
> >> > ReorderBufferTXN.  And, I think that is done so that we can send the
> >> > schema for each transaction stream so that if any subtransaction gets
> >> > aborted we don't lose the logical WAL for that schema.  But, this
> >> > solution has induced a very basic issue that if a transaction operate
> >> > on more than 1 relation then after sending the schema for the first
> >> > relation it will mark the flag true and the schema for the subsequent
> >> > relations will never be sent.
> >> >
> >>
> >> How about keeping a list of top-level xids in each RelationSyncEntry?
> >> Basically, whenever we send the schema for any transaction, we note
> >> that in RelationSyncEntry and at abort time we can remove xid from the
> >> list.  Now, whenever, we check whether to send schema for any
> >> operation in a transaction, we will check if our xid is present in
> >> that list for a particular RelationSyncEntry and take an action based
> >> on that (if xid is present, then we won't send the schema, otherwise,
> >> send it).
> >The idea make sense to me.  I will try to write a patch for this and test.
> >
>
> Yeah, the "is_schema_sent" flag in ReorderBufferTXN does not work - it
> needs to be in the RelationSyncEntry. In fact, I already have code for
> that in my private repository - I thought the patches I sent here do
> include this, but apparently I forgot to include this bit :-(
>
> Attached is a rebased patch series, fixing this. It's essentially v2
> with a couple of patches (0003, 0008, 0009 and 0012) replacing the
> is_schema_sent with correct handling.
>
>
> 0003 - removes an is_schema_sent reference added prematurely (it's added
> by a later patch, causing compile failure)
>
> 0008 - adds the is_schema_sent back (essentially reverting 0003)
>
> 0009 - removes is_schema_sent entirely
>
> 0012 - adds the correct handling of schema flags in pgoutput
>
>
> I don't know what other changes you've made since v2, so this way it
> should be possible to just take 0003, 0008, 0009 and 0012 and slip them
> in with minimal hassle.
>
> FWIW thanks to everyone (and Amit and Dilip in particular) working on
> this patch series.  There's been a lot of great reviews and improvements
> since I abandoned this thread for a while. I expect to be able to spend
> more time working on this in January.
>
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+
+ MemoryContextSwitchTo(oldctx);
+}
I was looking into the schema tracking solution and I have one
question, Shouldn't we remove the topxid from the list if the
(sub)transaction is aborted? because once it is aborted we need to
resent the schema.  I think we can remove the xid from the list in the
cleanup_rel_sync_cache function?


I have observed some more issues

1. Currently, In ReorderBufferCommit, it is always expected that
whenever we get REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM, we must
have already got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and in
SPEC_CONFIRM we send the tuple we got in SPECT_INSERT.  But, now those
two messages can be in different streams.  So we need to find a way to
handle this.  Maybe once we get SPEC_INSERT then we can remember the
tuple and then if we get the SPECT_CONFIRM in the next stream we can
send that tuple?

2. During commit time in DecodeCommit we check whether we need to skip
the changes of the transaction or not by calling
SnapBuildXactNeedsSkip but since now we support streaming so it's
possible that before we decode the commit WAL, we might have already
sent the changes to the output plugin even though we could have
skipped those changes.  So my question is instead of checking at the
commit time can't we check before adding to ReorderBuffer itself or we
can truncate the changes if SnapBuildXactNeedsSkip is true whenever
logical_decoding_workmem limit is reached.  Am I missing something
here?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

30 декабря 2019 г., 12:40:47

On Thu, Dec 12, 2019 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
Yesterday, Tomas has posted the latest version of the patch set which
contain the fix for schema send part.  Meanwhile, I was working on few
review comments/bugfixes and refactoring.  I have tried to merge those
changes with the latest patch set except the refactoring related to
"0006-Implement-streaming-mode-in-ReorderBuffer" patch, because Tomas
has also made some changes in the same patch.  I have created a
separate patch for the same so that we can review the changes and then
we can merge them to the main patch.

> On Wed, Dec 11, 2019 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Dec 9, 2019 at 1:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have review the patch set and here are few comments/questions
> > >
> > > 1.
> > > +static void
> > > +pg_decode_stream_change(LogicalDecodingContext *ctx,
> > > + ReorderBufferTXN *txn,
> > > + Relation relation,
> > > + ReorderBufferChange *change)
> > > +{
> > > + OutputPluginPrepareWrite(ctx, true);
> > > + appendStringInfo(ctx->out, "streaming change for TXN %u", txn->xid);
> > > + OutputPluginWrite(ctx, true);
> > > +}
> > >
> > > Should we show the tuple in the streamed change like we do for the
> > > pg_decode_change?
> > >
> >
> > I think so.  The patch shows the message in
> > pg_decode_stream_message(), so why to prohibit showing tuple here?

Yeah, we can do that.  One option is that we can directly register
"pg_decode_change" function as stream_change_cb plugin and that will
show the tuple, another option is that we can write a similar function
as pg_decode_change and change the message which includes the text
"STREAM" so that the user can distinguish between tuple from committed
transaction and the in-progress transaction.

While analyzing this solution I have encountered one more issue, the
problem is that currently, during commit time in DecodeCommit we check
whether we need to skip the changes of the transaction or not by
calling SnapBuildXactNeedsSkip but since now we support streaming so
it's possible that before commit wal arrive we might have already sent
the changes to the output plugin even though we could have skipped
those changes.  So my question is instead of checking at the commit
time can't we check before adding to ReorderBuffer itself or we can
truncate the changes if SnapBuildXactNeedsSkip is true whenever
logical_decoding_workmem limit is reached.

> > Few comments on this patch series:
> >
> > 0001-Immediately-WAL-log-assignments:
> > ------------------------------------------------------------
> >
> > The commit message still refers to the old design for this patch.  I
> > think you need to modify the commit message as per the latest patch.
Done
> >
> > 0002-Issue-individual-invalidations-with-wal_level-log
> > ----------------------------------------------------------------------------
> > 1.
> > xact_desc_invalidations(StringInfo buf,
> > {
> > ..
> > + else if (msg->id == SHAREDINVALSNAPSHOT_ID)
> > + appendStringInfo(buf, " snapshot %u", msg->sn.relId);
> >
> > You have removed logging for the above cache but forgot to remove its
> > reference from one of the places.  Also, I think you need to add a
> > comment somewhere in inval.c to say why you are writing for WAL for
> > some types of invalidations and not for others?
Done
> >
> > 0003-Extend-the-output-plugin-API-with-stream-methods
> > --------------------------------------------------------------------------------
> > 1.
> > +     are required, while <function>stream_message_cb</function> and
> > +     <function>stream_message_cb</function> are optional.
> >
> > stream_message_cb is mentioned twice.  It seems the second one is for truncate.
Done
> >
> > 2.
> > size of the transaction size and network bandwidth, the transfer time
> > +    may significantly increase the apply lag.
> >
> > /size of the transaction size/size of the transaction
> >
> > no need to mention size twice.
Done
> >
> > 3.
> > +    Similarly to spill-to-disk behavior, streaming is triggered when the total
> > +    amount of changes decoded from the WAL (for all in-progress
> > transactions)
> > +    exceeds limit defined by <varname>logical_work_mem</varname> setting.
> >
> > The guc name used is wrong.  /Similarly to/Similar to/
Done
> >
> > 4.
> > stream_start_cb_wrapper()
> > {
> > ..
> > + /* state.report_location = apply_lsn; */
> > ..
> > + /* FIXME ctx->write_location = apply_lsn; */
> > ..
> > }
> >
> > See, if we can fix these and similar in the callback for the stop.  I
> > think we don't have final_lsn till we commit/abort.  Can we compute
> > before calling these API's?
Done
> >
> >
> > 0005-Gracefully-handle-concurrent-aborts-of-uncommitte
> > ----------------------------------------------------------------------------------
> > 1.
> > @@ -1877,6 +1877,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> >   PG_CATCH();
> >   {
> >   /* TODO: Encapsulate cleanup
> > from the PG_TRY and PG_CATCH blocks */
> > +
> >   if (iterstate)
> >   ReorderBufferIterTXNFinish(rb, iterstate);
> >
> > Spurious line change.
> >
Done
> > 2. The commit message of this patch refers to Prepared transactions.
> > I think that needs to be changed.
> >
> > 0006-Implement-streaming-mode-in-ReorderBuffer
> > -------------------------------------------------------------------------
> > 1.
> > +
> > +/* iterator for streaming (only get data from memory) */
> > +static ReorderBufferStreamIterTXNState * ReorderBufferStreamIterTXNInit(
> > +
> > ReorderBuffer *rb,
> > +
> > ReorderBufferTXN
> > *txn);
> > +
> > +static ReorderBufferChange *ReorderBufferStreamIterTXNNext(
> > +    ReorderBuffer *rb,
> > +
> >    ReorderBufferStreamIterTXNState * state);
> > +
> > +static void ReorderBufferStreamIterTXNFinish(
> > +
> > ReorderBuffer *rb,
> > +
> > ReorderBufferStreamIterTXNState * state);
> >
> > Do we really need to introduce new APIs for iterating over changes
> > from streamed transactions?  Why can't we reuse the same API's as we
> > use for committed xacts?
Done
> >
> > 2.
> > +static void
> > +ReorderBufferStreamCommit(ReorderBuffer *rb, ReorderBufferTXN *txn)
> >
> > Please write some comments atop ReorderBufferStreamCommit.
Done
> >
> > 3.
> > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > {
> > ..
> > ..
> > + if (txn->snapshot_now
> > == NULL)
> > + {
> > + dlist_iter subxact_i;
> > +
> > + /* make sure this transaction is streamed for the first time */
> > +
> > Assert(!rbtxn_is_streamed(txn));
> > +
> > + /* at the beginning we should have invalid command ID */
> > + Assert(txn->command_id ==
> > InvalidCommandId);
> > +
> > + dlist_foreach(subxact_i, &txn->subtxns)
> > + {
> > + ReorderBufferTXN *subtxn;
> > +
> > +
> > subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> > +
> > + if (subtxn->base_snapshot != NULL &&
> > +
> > (txn->base_snapshot == NULL ||
> > + txn->base_snapshot_lsn > subtxn->base_snapshot_lsn))
> > + {
> > +
> > txn->base_snapshot = subtxn->base_snapshot;
> >
> > The logic here seems to be correct, but I am not sure why it is not
> > considered to purge the base snapshot before assigning the subtxn's
> > snapshot and similarly, we have not purged snapshot for subtxn once we
> > are done with it.  I think we can use
> > ReorderBufferTransferSnapToParent to replace part of the logic here.
> > Do you see any reason for doing things differently here?
Done
> >
> > 4. In ReorderBufferStreamTXN, why do you need to use
> > ReorderBufferCopySnap to assign txn->base_snapshot to snapshot_now.

IMHO, here instead of directly copying the base snapshot we are
modifying it by passing command id and thats the reason we are copying
it.
> >
> > 5. I see a lot of code similarity in ReorderBufferStreamTXN and
> > existing ReorderBufferCommit. I understand that there are some subtle
> > differences due to which we need to write this new function but can't
> > we encapsulate the specific parts of code in functions and then call
> > from both places.  I am talking about code in different cases for
> > change->action.
Done
> >
> > 6. + * Note: We never stream and serialize a transaction at the same time (e
> > /(e/(we
Done

I have also found one bug in
"v3-0012-fixup-add-proper-schema-tracking.patch" due to which some of
the streaming test cases were failing, I have created a separate patch
to fix the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 6, 2020 at 9:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > It is better to merge it with the main patch for
> > > "Implement-streaming-mode-in-ReorderBuffer", otherwise, it is a bit
> > > difficult to review.
> > Actually, we can merge 0008, 0009, 0012, 0018 to the main patch
> > (0007).  Basically, if we merge all of them then we don't need to deal
> > with the conflict.  I think Tomas has kept them separate so that we
> > can review the solution for the schema sent.  And, I kept 0018 as a
> > separate patch to avoid conflict and rebasing in 0008, 0009 and 0012.
> > In the next patch set, I will merge all of them to 0007.
> >
>
> Okay, I think we can merge those patches.
Done
0008, 0009, 0017, 0018 are merged to 0007, 0012 is merged to 0010

>
> Few more comments:
> --------------------------------
> v4-0007-Implement-streaming-mode-in-ReorderBuffer
> 1.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> + * information about
> subtransactions, which could arrive after streaming start.
> + */
> + if (!txn->is_schema_sent)
> + snapshot_now
> = ReorderBufferCopySnap(rb, txn->base_snapshot,
> + txn,
> command_id);
> ..
> }
>
> Why are we using base snapshot here instead of the snapshot we saved
> the first time streaming has happened?  And as mentioned in comments,
> won't we need to consider the snapshots for subtransactions that
> arrived after the last time we have streamed the changes?
Fixed
>
> 2.
> + /* remember the command ID and snapshot for the streaming run */
> + txn->command_id = command_id;
> + txn-
> >snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +
>   txn, command_id);
>
> I don't see where the txn->snapshot_now is getting freed.  The
> base_snapshot is freed in ReorderBufferCleanupTXN, but I don't see
> this getting freed.
I have freed this In ReorderBufferCleanupTXN
>
> 3.
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * If this is a subxact, we need to stream the top-level transaction
> + * instead.
> + */
> + if (txn->toptxn)
> + {
> +
> ReorderBufferStreamTXN(rb, txn->toptxn);
> + return;
> + }
>
> Is it ever possible that we reach here for subtransaction, if not,
> then it should be Assert rather than if condition?
Fixed
>
> 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
> fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
> especially to cover the case when it gets called due to memory
> overflow (aka via ReorderBufferCheckMemoryLimit).
We get origin_lsn during commit time so I am not sure how can we do
that.  I have also noticed that currently, we are not using origin_lsn
on the subscriber side.  I think need more investigation that if we
want this then do we need to log it early.

>
> v4-0017-Extend-handling-of-concurrent-aborts-for-streamin
> 1.
> @@ -3712,7 +3727,22 @@ ReorderBufferStreamTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>   if (using_subtxn)
>
> RollbackAndReleaseCurrentSubTransaction();
>
> - PG_RE_THROW();
> + /* re-throw only if it's not an abort */
> + if (errdata-
> >sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
> + {
> + MemoryContextSwitchTo(ecxt);
> + PG_RE_THROW();
> +
> }
> + else
> + {
> + /* remember the command ID and snapshot for the streaming run */
> + txn-
> >command_id = command_id;
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +
>   txn, command_id);
> + rb->stream_stop(rb, txn);
> +
> +
> FlushErrorState();
> + }
>
> Can you update comments either in the above code block or some other
> place to explain what is the concurrent abort problem and how we dealt
> with it?  Also, please explain how the above error handling is
> sufficient to address all the various scenarios (sub-transaction got
> aborted when we have already sent some changes, or when we have not
> sent any changes yet).

Done
>
> v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
> 1.
> + /*
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> + * error out
> + */
> + if (TransactionIdIsValid(CheckXidAlive) &&
> + !TransactionIdIsInProgress(CheckXidAlive) &&
> + !TransactionIdDidCommit(CheckXidAlive))
> + ereport(ERROR,
> + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> + errmsg("transaction aborted during system catalog scan")));
>
> Why here we can't use TransactionIdDidAbort?  If we can't use it, then
> can you add comments stating the reason of the same.
Done
>
> 2.
> /*
> + * An xid value pointing to a possibly ongoing or a prepared transaction.
> + * Currently used in logical decoding.  It's possible that such transactions
> + * can get aborted while the decoding is ongoing.
> + */
> +TransactionId CheckXidAlive = InvalidTransactionId;
>
> In comments, there is a mention of a prepared transaction.  Do we
> allow prepared transactions to be decoded as part of this patch?
Fixed
>
> 3.
> + /*
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> + * error out
> + */
> + if (TransactionIdIsValid
> (CheckXidAlive) &&
> + !TransactionIdIsInProgress(CheckXidAlive) &&
> + !TransactionIdDidCommit(CheckXidAlive))
>
> This comment just says what code below is doing, can you explain the
> rationale behind this check.  It would be better if it is clear by
> reading comments, why we are doing this check after fetching the
> tuple.  I think this can refer to the comment I suggested to add for
> changes in patch
> v4-0017-Extend-handling-of-concurrent-aborts-for-streamin.
Done


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On 2020-Jan-10, Alvaro Herrera wrote:

> Here's a rebase of this patch series.  I didn't change anything except

... this time with attachments ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Sat, Jan 11, 2020 at 3:07 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Jan-10, Alvaro Herrera wrote:
>
> > Here's a rebase of this patch series.  I didn't change anything except
>
> ... this time with attachments ...
The patch set fails to apply on the head so rebased. (Rebased on
commit cebf9d6e6ee13cbf9f1a91ec633cf96780ffc985)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

21 января 2020 г., 19:36:07

On Tue, Jan 14, 2020 at 10:56:37AM +0530, Dilip Kumar wrote:
>On Sat, Jan 11, 2020 at 3:07 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> On 2020-Jan-10, Alvaro Herrera wrote:
>>
>> > Here's a rebase of this patch series.  I didn't change anything except
>>
>> ... this time with attachments ...
>The patch set fails to apply on the head so rebased. (Rebased on
>commit cebf9d6e6ee13cbf9f1a91ec633cf96780ffc985)
>

I've noticed the patch was in WoA state since 2019/12/01, but there's
been quite a lot of traffic on this thread and a bunch of new patch
versions. So I've switched it to "needs review" - if that's not the
right status, let me know.

Also, the patch was moved forward mostly by Amit and Dilip, so I've
added them as authors in the CF app (well, what matters is the commit
message, of course, but let's keep this up to date too).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

22 января 2020 г., 08:00:25

On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 13, 2020 at 3:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jan 9, 2020 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jan 9, 2020 at 10:30 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Thu, Jan 9, 2020 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > >  The problem is that when we
> > > > > > get a toasted chunks we remember the changes in the memory(hash table)
> > > > > > but don't stream until we get the actual change on the main table.
> > > > > > Now, the problem is that we might get the change of the toasted table
> > > > > > and the main table in different streams.  So basically, in a stream,
> > > > > > if we have only got the toasted tuples then even after
> > > > > > ReorderBufferStreamTXN the memory usage will not be reduced.
> > > > > >
> > > > >
> > > > > I think we can't split such changes in a different stream (unless we
> > > > > design an entirely new solution to send partial changes of toast
> > > > > data), so we need to send them together. We can keep a flag like
> > > > > data_complete in ReorderBufferTxn and mark it complete only when we
> > > > > are able to assemble the entire tuple.  Now, whenever, we try to
> > > > > stream the changes once we reach the memory threshold, we can check
> > > > > whether the data_complete flag is true
>
> Here, we can also consider streaming the changes when data_complete is
> false, but some additional changes have been added to the same txn as
> the new changes might complete the tuple.
>
> > > > > , if so, then only send the
> > > > > changes, otherwise, we can pick the next largest transaction.  I think
> > > > > we can retry it for few times and if we get the incomplete data for
> > > > > multiple transactions, then we can decide to spill the transaction or
> > > > > maybe we can directly spill the first largest transaction which has
> > > > > incomplete data.
> > > > >
> > > > Yeah, we might do something on this line.  Basically, we need to mark
> > > > the top-transaction as data-incomplete if any of its subtransaction is
> > > > having data-incomplete (it will always be the latest sub-transaction
> > > > of the top transaction).  Also, for streaming, we are checking the
> > > > largest top transaction whereas for spilling we just need the larget
> > > > (sub) transaction.   So we also need to decide while picking the
> > > > largest top transaction for streaming, if we get a few transactions
> > > > with in-complete data then how we will go for the spill.  Do we spill
> > > > all the sub-transactions under this top transaction or we will again
> > > > find the larget (sub) transaction for spilling.
> > > >
> > >
> > > I think it is better to do later as that will lead to the spill of
> > > only required (minimum changes to get the memory below threshold)
> > > changes.
> > I think instead of doing this can't we just spill the changes which
> > are in toast_hash.  Basically, at the end of the stream, we have some
> > toast tuple which we could not stream because we did not have the
> > insert for the main table then we can spill only those changes which
> > are in tuple hash.
> >
>
> Hmm, I think this can turn out to be inefficient because we can easily
> end up spilling the data even when we don't need to so.  Consider
> cases, where part of the streamed changes are for toast, and remaining
> are the changes which we would have streamed and hence can be removed.
> In such cases, we could have easily consumed remaining changes for
> toast without spilling.  Also, I am not sure if spilling changes from
> the hash table is a good idea as they are no more in the same order as
> they were in ReorderBuffer which means the order in which we serialize
> the changes normally would change and that might have some impact, so
> we would need some more study if we want to pursue this idea.
I have fixed this bug and attached it as a separate patch.  I will
merge it to the main patch after we agree with the idea and after some
more testing.

The idea is that whenever we get the toasted chunk instead of directly
inserting it into the toast hash I am inserting it into some local
list so that if we don't get the change for the main table then we can
insert these changes back to the txn->changes.  So once we get the
change for the main table at that time I am preparing the hash table
to merge the chunks.  If the stream is over and we haven't got the
changes for the main table, that time we will mark the txn that it has
some pending toast changes so that next time we will not pick the same
transaction for the streaming.  This flag will be cleaned whenever we
get any changes for the txn (insert or /update).  There is also a
possibility that even after we stream the changes the rb->size is not
below logical_decoding_work_mem because we could not stream the
changes so for handling this after streaming we recheck the size and
if it is still not under control then we pick another transaction.  In
some cases, we might not get any transaction to stream because the
transaction has the pending toast change flag set, In this case, we
will go for the spill.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

22 января 2020 г., 08:10:40

On Sat, Jan 4, 2020 at 4:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
Update on the open items
> As per my understanding apart from the above comments, the known
> pending work for this patchset is as follows:
> a. The two open items agreed to you in the email [3].  -> The first part is done and the second part is an
improvement,not a bugfix.  I will try to work on this part in the next patch set.
 
> b. Complete the handling of schema_sent as discussed above [4].  -> Done
> c. Few comments by Vignesh and the response on the same by me [5][6]. -> Done
> d. WAL overhead and performance testing for additional WAL logging by
> this patchset. -> Pending
> e. Some way to see the tuple for streamed transactions by decoding API
> as speculated by you [7]. ->Pending
f. Bug in the toast table handling -> Submitted as a separate POC
patch, which can be merged to the main after review and more testing.

> [3] - https://www.postgresql.org/message-id/CAFiTN-uT5YZE0egGhKdTteTjcGrPi8hb%3DFMPpr9_hEB7hozQ-Q%40mail.gmail.com
> [4] - https://www.postgresql.org/message-id/CAA4eK1KjD9x0mS4JxzCbu3gu-r6K7XJRV%2BZcGb3BH6U3x2uxew%40mail.gmail.com
> [5] - https://www.postgresql.org/message-id/CALDaNm0DNUojjt7CV-fa59_kFbQQ3rcMBtauvo44ttea7r9KaA%40mail.gmail.com
> [6] - https://www.postgresql.org/message-id/CAA4eK1%2BZvupW00c--dqEg8f3dHZDOGmA9xOQLyQHjRSoDi6AkQ%40mail.gmail.com
> [7] - https://www.postgresql.org/message-id/CAFiTN-t8PmKA1X4jEqKmkvs0ggWpy0APWpPuaJwpx2YpfAf97w%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Alvaro Herrera

Дата:

22 января 2020 г., 19:37:36

I looked at this patchset and it seemed natural to apply 0008 next
(adding work_mem to subscriptions).  Attached is Dilip's latest version,
plus my review changes.  This will break the patch tester's logic; sorry
about that.

What part of this change is what sets the process's
logical_decoding_work_mem to the given value?  I was unable to figure
that out.  Is it missing or am I just stupid?

Changes:
* the patch adds logical_decoding_work_mem SGML, but that has already
  been applied (cec2edfa7859); remove dupe.

* parse_subscription_options() comment says that it will raise an error if a
  caller does not pass the pointer for an option but option list
  specifies that option.  It does not really implement that behavior (an
  existing problem): instead, if the pointer is not passed, the option
  is ignored.  Moreover, this new patch continued to fail to handle
  things as the comment says.  I decided to implement the documented
  behavior instead; it's now inconsistent with how the other options are
  implemented.  I think we should fix the other options to behave as the
  comment says, because it's a more convenient API; if we instead opted
  to update the code comment to match the code, each caller would have
  to be checked to verify that the correct options are passed, which is
  pointless and error prone.

* the parse_subscription_options API is a mess.  I reordered the
  arguments a little bit; also change the argument layout in callers so
  that each caller is grouped more sensibly.  Also added comments to
  simplify reading the argument lists.  I think this could be fixed by
  using an ad-hoc struct to pass in and out.  Didn't get around to doing
  that, seems an unrelated potential improvement.

* trying to do own range checking in pgoutput and subscriptioncmds.c
  seems pointless and likely to get out of sync with guc.c.  Simpler is
  to call set_config_option() to verify that the argument is in range.
  (Note a further problem in the patch series: the range check in
  subscriptioncmds.c is only added in patch 0009).

* parsing integers using scanint8() seemed weird (error messages there
  do not correspond to what we want).  After a couple of false starts, I
  decided to rely on guc.c's set_config_option() followed by parse_int().
  That also has the benefit that you can give it units.

* psql \dRs+ should display the work_mem; patch failed to do that.
  Added.  Unit display is done by pg_size_pretty(), which might be
  different from what guc.c does, but I think it works OK.
  It's the first place where we use pg_size_pretty to show a memory
  limit, however.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 10, 2020 at 10:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jan 6, 2020 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > >
> > > Few more comments:
> > > --------------------------------
> > > v4-0007-Implement-streaming-mode-in-ReorderBuffer
> > > 1.
> > > +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> > > {
> > > ..
> > > + /*
> > > + * TOCHECK: We have to rebuild historic snapshot to be sure it includes all
> > > + * information about
> > > subtransactions, which could arrive after streaming start.
> > > + */
> > > + if (!txn->is_schema_sent)
> > > + snapshot_now
> > > = ReorderBufferCopySnap(rb, txn->base_snapshot,
> > > + txn,
> > > command_id);
> > > ..
> > > }
> > >
> > > Why are we using base snapshot here instead of the snapshot we saved
> > > the first time streaming has happened?  And as mentioned in comments,
> > > won't we need to consider the snapshots for subtransactions that
> > > arrived after the last time we have streamed the changes?
> > Fixed
> >
>
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> + /*
> + * We can not use txn->snapshot_now directly because after we there
> + * might be some new sub-transaction which after the last streaming run
> + * so we need to add those sub-xip in the snapshot.
> + */
> + snapshot_now = ReorderBufferCopySnap(rb, txn->snapshot_now,
> + txn, command_id);
>
> "because after we there", you seem to forget a word between 'we' and
> 'there'.
Fixed

  So as we are copying it now, does this mean it will consider
> the snapshots for subtransactions that arrived after the last time we
> have streamed the changes? If so, have you tested it and can we add
> the same in comments.
Yes I have tested.  Comment added.
>
> Also, if we need to copy the snapshot here, then do we need to again
> copy it in ReorderBufferProcessTXN(in below code and in catch block in
> the same function).
>
> {
> ..
> + /*
> + * Remember the command ID and snapshot if transaction is streaming
> + * otherwise free the snapshot if we have copied it.
> + */
> + if (streaming)
> + {
> + txn->command_id = command_id;
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +   txn, command_id);
> + }
> + else if (snapshot_now->copied)
> + ReorderBufferFreeSnap(rb, snapshot_now);
> ..
> }
>
Fixed
> > >
> > > 4. In ReorderBufferStreamTXN(), don't we need to set some of the txn
> > > fields like origin_id, origin_lsn as we do in ReorderBufferCommit()
> > > especially to cover the case when it gets called due to memory
> > > overflow (aka via ReorderBufferCheckMemoryLimit).
> > We get origin_lsn during commit time so I am not sure how can we do
> > that.  I have also noticed that currently, we are not using origin_lsn
> > on the subscriber side.  I think need more investigation that if we
> > want this then do we need to log it early.
> >
>
> Have you done any investigation of this point?  You might want to look
> at pg_replication_origin* APIs.  Today, again looking at this code, I
> think with current coding, it won't be used even when we encounter
> commit record.  Because ReorderBufferCommit calls
> ReorderBufferStreamCommit which will make sure that origin_id and
> origin_lsn is never sent.  I think at least that should be fixed, if
> not, probably, we need a comment with reasoning why we think it is
> okay not to do in this case.
Still, the problem is the same because, currently, we are sending
origin_lsn as part of the "pgoutput_begin" message.  Now, for the
streaming transaction,
we have already sent the stream start.  However, we might send this
during the stream commit, but I am not completely sure because
currently,
the consumer of this message "apply_handle_origin" is just ignoring
it.  I have also looked into pg_replication_origin* APIs and they are
used for setting origin id and
tracking the progress, but they will not consume the origin_lsn we are
sending in pgoutput_begin so this is not directly related.

>
> + /*
> + * If we are streaming the in-progress transaction then Discard the
>
> /Discard/discard
Done
>
> > >
> > > v4-0006-Gracefully-handle-concurrent-aborts-of-uncommitte
> > > 1.
> > > + /*
> > > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> > > + * error out
> > > + */
> > > + if (TransactionIdIsValid(CheckXidAlive) &&
> > > + !TransactionIdIsInProgress(CheckXidAlive) &&
> > > + !TransactionIdDidCommit(CheckXidAlive))
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> > > + errmsg("transaction aborted during system catalog scan")));
> > >
> > > Why here we can't use TransactionIdDidAbort?  If we can't use it, then
> > > can you add comments stating the reason of the same.
> > Done
>
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we
> + * error out.  Instead of directly checking the abort status we do check
> + * if it is not in progress transaction and no committed. Because if there
> + * were a system crash then status of the the transaction which were running
> + * at that time might not have marked.  So we need to consider them as
> + * aborted.  Refer detailed comments at snapmgr.c where the variable is
> + * declared.
>
>
> How about replacing the above comment with below one:
>
> If CheckXidAlive is valid, then we check if it aborted. If it did, we
> error out.  We can't directly use TransactionIdDidAbort as after crash
> such transaction might not have been marked as aborted.  See detailed
> comments at snapmgr.c where the variable is declared.
Done
>
> I am not able to understand the change in
> v8-0011-BUGFIX-set-final_lsn-for-subxacts-before-cleanup.  Do you have
> any explanation for the same?

It appears that in ReorderBufferCommitChild we are always setting the
final_lsn of the subxacts so it should not be invalid.  For testing, I
have changed this as an assert and checked but it never hit.  So maybe
we can remove this change.

Apart from that, I have fixed the toast tuple streaming bug by setting
the flag bit in the WAL (attached as 0012).  I have also extended this
solution for handling the speculative insert bug so old patch for a
speculative insert bug fix is removed. I am also exploring the
solution that how can we do this without setting the flag in the WAL
as we discussed upthread.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Wed, Feb 5, 2020 at 4:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 5, 2020 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think we don't need to maintain
> v8-0007-Support-logical_decoding_work_mem-set-from-create as per
> discussion in one of the above emails [1] as its usage is not clear.

Done

> v8-0008-Add-support-for-streaming-to-built-in-replication
> 1.
> -      information.  The allowed options are <literal>slot_name</literal> and
> -      <literal>synchronous_commit</literal>
> +      information.  The allowed options are <literal>slot_name</literal>,
> +      <literal>synchronous_commit</literal>, <literal>work_mem</literal>
> +      and <literal>streaming</literal>.
>
> As per the discussion above [1], I don't think we need work_mem here.
> You might want to remove the other usage from the patch as well.

Done

> 2.
> @@ -59,7 +59,8 @@ parse_subscription_options(List *options, bool
> *connect, bool *enabled_given,
>      bool *slot_name_given, char **slot_name,
>      bool *copy_data, char **synchronous_commit,
>      bool *refresh, int *logical_wm,
> -    bool *logical_wm_given)
> +    bool *logical_wm_given, bool *streaming,
> +    bool *streaming_given)
>
> It is not clear to me why we need two parameters 'streaming' and
> 'streaming_given' in this API.  Can't we handle similar to parameter
> 'refresh'?

The streaming option we need to update in the system table, so if we
don't remember whether the user has given its value or not then how we
will know whether to update this column or not?  Or you are suggesting
that we should always mark this as updated but IMHO that is not a good
idea.

> 3.
> diff --git a/src/backend/replication/logical/launcher.c
> b/src/backend/replication/logical/launcher.c
> index aec885e..e80d00c 100644
> --- a/src/backend/replication/logical/launcher.c
> +++ b/src/backend/replication/logical/launcher.c
> @@ -14,6 +14,8 @@
>   *
>   *-------------------------------------------------------------------------
>   */
> +#include <sys/types.h>
> +#include <unistd.h>
>
>  #include "postgres.h"
>
> I see only the above change in launcher.c.  Why we need to include
> these if there is no other change (at least not in this patch).

Removed

> 4.
> stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
>   /* Push callback + info on the error context stack */
>   state.ctx = ctx;
>   state.callback_name = "stream_start";
> - /* state.report_location = apply_lsn; */
> + state.report_location = InvalidXLogRecPtr;
>   errcallback.callback = output_plugin_error_callback;
>   errcallback.arg = (void *) &state;
>   errcallback.previous = error_context_stack;
> @@ -1194,7 +1194,7 @@ stream_stop_cb_wrapper(ReorderBuffer *cache,
> ReorderBufferTXN *txn)
>   /* Push callback + info on the error context stack */
>   state.ctx = ctx;
>   state.callback_name = "stream_stop";
> - /* state.report_location = apply_lsn; */
> + state.report_location = InvalidXLogRecPtr;
>   errcallback.callback = output_plugin_error_callback;
>   errcallback.arg = (void *) &state;
>   errcallback.previous = error_context_stack;
>
> Don't we want to set txn->final_lsn in report location as we do at few
> other places?

Fixed

> 5.
> -logicalrep_write_delete(StringInfo out, Relation rel, HeapTuple oldtuple)
> +logicalrep_write_delete(StringInfo out, TransactionId xid,
> + Relation rel, HeapTuple oldtuple)
>  {
> + pq_sendbyte(out, 'D'); /* action DELETE */
> +
>   Assert(rel->rd_rel->relreplident == REPLICA_IDENTITY_DEFAULT ||
>      rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL ||
>      rel->rd_rel->relreplident == REPLICA_IDENTITY_INDEX);
>
> - pq_sendbyte(out, 'D'); /* action DELETE */
>
> Why this patch need to change the above code?

Fixed

> 6.
> +void
> +logicalrep_write_stream_start(StringInfo out,
> +   TransactionId xid, bool first_segment)
> +{
> + pq_sendbyte(out, 'S'); /* action STREAM START */
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, xid);
> +
> + /* 1 if this is the first streaming segment for this xid */
> + pq_sendint32(out, first_segment ? 1 : 0);
> +}
> +
> +TransactionId
> +logicalrep_read_stream_start(StringInfo in, bool *first_segment)
> +{
> + TransactionId xid;
> +
> + Assert(first_segment);
> +
> + xid = pq_getmsgint(in, 4);
> + *first_segment = (pq_getmsgint(in, 4) == 1);
> +
> + return xid;
> +}
>
> In these functions for sending bool, pq_sendint32 is used.  Can't we
> use pq_sendbyte similar to what we do in boolsend?

Done

> 7.
> +void
> +logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
> +{
> + pq_sendbyte(out, 'E'); /* action STREAM END */
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, xid);
> +}
>
> In comments, 'starting to stream' is mentioned whereas this function
> is to stop it.

Fixed

> 8.
> +void
> +logicalrep_write_stream_stop(StringInfo out, TransactionId xid)
> +{
> + pq_sendbyte(out, 'E'); /* action STREAM END */
> +
> + Assert(TransactionIdIsValid(xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, xid);
> +}
> +
> +TransactionId
> +logicalrep_read_stream_stop(StringInfo in)
> +{
> + TransactionId xid;
> +
> + xid = pq_getmsgint(in, 4);
> +
> + return xid;
> +}
>
> Is there a reason to send xid on stopping stream?  I don't see any use
> of function logicalrep_read_stream_stop.

Removed

> 9.
> + * XXX Add calls to pgstat_report_wait_start/pgstat_report_wait_end.
> + */
> +static void
> +subxact_info_write(Oid subid, TransactionId xid)
> {
> ..
> + pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);
> ..
> + pgstat_report_wait_end();
> ..
> }
>
> I see the calls to pgstat_report_wait_start/pgstat_report_wait_end in
> this function, so not sure if the above comment makes sense.

Fixed

> 10.
> + * The files are placed in /tmp by default, and the filenames include both
> + * the XID of the toplevel transaction and OID of the subscription.
>
> Are we keeping files in /tmp or pg's temp tablespace dir.  Seeing
> below code, it doesn't seem that we place them in /tmp.  If I am
> correct, then can you update the comment.
> +static void
> +subxact_filename(char *path, Oid subid, TransactionId xid)
> +{
> + char tempdirpath[MAXPGPATH];
> +
> + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);

Done

> 11.
> + * The change is serialied in a simple format, with length (not including
> + * the length), action code (identifying the message type) and message
> + * contents (without the subxact TransactionId value).
> + *
> ..
> + */
> +static void
> +stream_write_change(char action, StringInfo s)
>
> The part of the comment which says "with length (not including the
> length) .." is not clear to me.  What does "not including the length"
> mean?

Basically, it says that the 4 bytes which are used for storing then
the length of total data doesn't include the 4 bytes.

> 12.
> + * TODO: Add missing_ok flag to specify in which cases it's OK not to
> + * find the files, and when it's an error.
> + */
> +static void
> +stream_cleanup_files(Oid subid, TransactionId xid)
>
> I think we can implement this TODO.  It is clear when this function is
> called from apply_handle_stream_commit, the file must exist.  We can
> similarly analyze other callers of this API.

Done

> 13.
> +apply_handle_stream_abort(StringInfo s)
> {
> ..
> + /* FIXME optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> ..
>
> I am not sure how important this optimization is, so instead of FIXME,
> it is better to keep it as a XXX comment.  In the future, if we hit
> any performance issue due to this, we can revisit our decision.

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

13 февраля 2020 г., 06:12:21

On Tue, Feb 11, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The patch set was not applying on the head so I have rebased it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 февраля 2020 г., 08:07:44

On Thu, Feb 13, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Feb 11, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> The patch set was not applying on the head so I have rebased it.

I have changed the patch 0002 so that instead of logging the WAL for
each invalidation, now we log at each command end as discussed
upthread[1]

Soon we will evaluate the performance for the same and post the results.

[1] https://www.postgresql.org/message-id/CAA4eK1LOa%2B2KqNX%3Dm%3D1qMBDW%2Bo50AuwjAOX6ZqL-rWGiH1F9MQ%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

04 марта 2020 г., 00:46:46

Hi,

I started looking at this patch series again, hoping to get it moving
for PG13. There's been a tremendous amount of work done since I last
worked on it, and a lot was discussed on this thread, so it'll take a
while to get familiar with the new code ...

The first thing I realized that WAL-logging of assignments in v12 does
both the "old" logging (using dedicated message) and "new" with
toplevel-XID embedded in the first message. Yes, the patch was wrong,
because it eliminated all calls to ProcArrayApplyXidAssignment() and so
it was trivial to crash the replica due to KnownAssignedXids overflow.
But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
right fix.

I actually proposed doing this (having both ways to log assignments) so
that there's no regression risk with (wal_level < logical). But IIRC
Andres objected to it, argumenting that we should not log the same piece
of information in two very different ways at the same time (IIRC it was
discussed on the FOSDEM dev meeting, so I don't have a link to share).
And I do agree with him ...

The question is, why couldn't the replica use the same assignment info
we already write for logical decoding? The main challenge is that now
the assignment can be sent in many different xlog messages, from a bunch
of resource managers (essentially, any xlog message with a xid can have
embedded XID of the toplevel xact). So the handling would either need to
happen in every rmgr, or we need to move it before we call the rmgr.

For exampple, we might do this e.g. in StartupXLOG() I think, per the
attached patch (FWIW this particular fix was written by Masahiko Sawada,
not me). This does the trick for me - I'm no longer able to reproduce
the KnownAssignedXids overflow.

The one difference is that we used to call ProcArrayApplyXidAssignment
for larger groups of XIDs, as sent in the assignment message. Now we
call it for each individual assignment. I don't know if this is an
issue, but I suppose we might introduce some sort of local caching
(accumulate the assignments into a local array, call the function only
when we have enough of them).

Aside from that, I think there's a minor bug in xact.c - the patch adds
a "assigned" field to TransactionStateData, but then it fails to add a
default value into TopTransactionStateData. We probably interpret NULL
as false, but then there's nothing for the pointer. I suspect it might
leave some random garbage there, leading to strange things later.

Another thing I noticed is LogicalDecodingProcessRecord() extracts the
toplevel XID using a macro

   txid = XLogRecGetTopXid(record);

but then it just starts accessing the fields directly again in the
ReorderBufferAssignChild call. I think we should do this instead:

     ReorderBufferAssignChild(ctx->reorder,
                              txid,
                 XLogRecGetXid(record),
                              buf.origptr);


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

04 марта 2020 г., 00:59:11

D'oh! As usual I forgot to actually attach the patch I mentioned. So
here it is ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

xid-assignment-v12-fix.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

04 марта 2020 г., 06:43:49

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> I started looking at this patch series again, hoping to get it moving
> for PG13.

Nice.

 There's been a tremendous amount of work done since I last
> worked on it, and a lot was discussed on this thread, so it'll take a
> while to get familiar with the new code ...
>
> The first thing I realized that WAL-logging of assignments in v12 does
> both the "old" logging (using dedicated message) and "new" with
> toplevel-XID embedded in the first message. Yes, the patch was wrong,
> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> it was trivial to crash the replica due to KnownAssignedXids overflow.
> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> right fix.
>
> I actually proposed doing this (having both ways to log assignments) so
> that there's no regression risk with (wal_level < logical). But IIRC
> Andres objected to it, argumenting that we should not log the same piece
> of information in two very different ways at the same time (IIRC it was
> discussed on the FOSDEM dev meeting, so I don't have a link to share).
> And I do agree with him ...
>
> The question is, why couldn't the replica use the same assignment info
> we already write for logical decoding? The main challenge is that now
> the assignment can be sent in many different xlog messages, from a bunch
> of resource managers (essentially, any xlog message with a xid can have
> embedded XID of the toplevel xact). So the handling would either need to
> happen in every rmgr, or we need to move it before we call the rmgr.
>
> For exampple, we might do this e.g. in StartupXLOG() I think, per the
> attached patch (FWIW this particular fix was written by Masahiko Sawada,
> not me). This does the trick for me - I'm no longer able to reproduce
> the KnownAssignedXids overflow.
>
> The one difference is that we used to call ProcArrayApplyXidAssignment
> for larger groups of XIDs, as sent in the assignment message. Now we
> call it for each individual assignment. I don't know if this is an
> issue, but I suppose we might introduce some sort of local caching
> (accumulate the assignments into a local array, call the function only
> when we have enough of them).

Thanks for the pointers,  I will think over these points.

>
> Aside from that, I think there's a minor bug in xact.c - the patch adds
> a "assigned" field to TransactionStateData, but then it fails to add a
> default value into TopTransactionStateData. We probably interpret NULL
> as false, but then there's nothing for the pointer. I suspect it might
> leave some random garbage there, leading to strange things later.

Actually, we will never access that field for the
TopTransactionStateData, right?
See below code,  we have a check that only if IsSubTransaction(), then
we access the "assigned" filed.

+bool
+IsSubTransactionAssignmentPending(void)
+{
+ if (!XLogLogicalInfoActive())
+ return false;
+
+ /* we need to be in a transaction state */
+ if (!IsTransactionState())
+ return false;
+
+ /* it has to be a subtransaction */
+ if (!IsSubTransaction())
+ return false;
+
+ /* the subtransaction has to have a XID assigned */
+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+ return false;
+
+ /* and it needs to have 'assigned' */
+ return !CurrentTransactionState->assigned;
+
+}

>
> Another thing I noticed is LogicalDecodingProcessRecord() extracts the
> toplevel XID using a macro
>
>    txid = XLogRecGetTopXid(record);
>
> but then it just starts accessing the fields directly again in the
> ReorderBufferAssignChild call. I think we should do this instead:
>
>      ReorderBufferAssignChild(ctx->reorder,
>                               txid,
>                              XLogRecGetXid(record),
>                               buf.origptr);

Make sense.  I will change this in the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

04 марта 2020 г., 07:58:32

On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> I started looking at this patch series again, hoping to get it moving
> for PG13.
>

It is good to keep moving this forward, but there are quite a few
problems with the design which need a broader discussion.  Some of
what I recall are:
a. Handling of abort of concurrent transactions.  There is some code
in the patch which might work, but there is not much discussion when
it was posted.
b. Handling of partial tuples (while streaming, we came to know that
toast tuple is not complete or speculative insert is incomplete).  For
this also, we have proposed a few solutions which need further
discussion.  One of those is implemented in the patch series.
c. We might also need some handling for replication origins.
d. Try to minimize the performance overhead of WAL logging for
invalidations.  We discussed different solutions for this and
implemented one of those.
e. How to skip already streamed transactions.

There might be a few more which I can't recall now.  Apart from this,
I haven't done any detailed review of subscriber-side implementation
where we write streamed transactions to file.  All of this will need
much more discussion and review before we can say it is ready to
commit, so I thought it might be better to pick it up for PG14 and
focus on other things that have a better chance for PG13 especially
because all the problems were not solved/discussed before last CF.
However, it is a good idea to keep moving this and have a discussion
on some of these issues.

> There's been a tremendous amount of work done since I last
> worked on it, and a lot was discussed on this thread, so it'll take a
> while to get familiar with the new code ...
>
> The first thing I realized that WAL-logging of assignments in v12 does
> both the "old" logging (using dedicated message) and "new" with
> toplevel-XID embedded in the first message. Yes, the patch was wrong,
> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> it was trivial to crash the replica due to KnownAssignedXids overflow.
> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> right fix.
>
> I actually proposed doing this (having both ways to log assignments) so
> that there's no regression risk with (wal_level < logical). But IIRC
> Andres objected to it, argumenting that we should not log the same piece
> of information in two very different ways at the same time (IIRC it was
> discussed on the FOSDEM dev meeting, so I don't have a link to share).
> And I do agree with him ...
>

So, aren't we worried about the overhead of the amount of WAL and
performance impact for the transactions?  We might want to check the
pgbench read-write test to see if that will add any significant
overhead.

> The question is, why couldn't the replica use the same assignment info
> we already write for logical decoding?
>

I haven't thought about it in detail, but we can think on those lines
if the performance overhead is in the acceptable range.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

04 марта 2020 г., 12:03:40

On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > The first thing I realized that WAL-logging of assignments in v12 does
> > both the "old" logging (using dedicated message) and "new" with
> > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > right fix.
> >
> > I actually proposed doing this (having both ways to log assignments) so
> > that there's no regression risk with (wal_level < logical). But IIRC
> > Andres objected to it, argumenting that we should not log the same piece
> > of information in two very different ways at the same time (IIRC it was
> > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > And I do agree with him ...
> >
>
> So, aren't we worried about the overhead of the amount of WAL and
> performance impact for the transactions?  We might want to check the
> pgbench read-write test to see if that will add any significant
> overhead.
>

I have briefly looked at the original patch and it seems the
additional overhead is only when subtransactions are involved, so
ideally, it shouldn't impact default pgbench, but there is no harm in
checking.  It might be that we need to build a custom script with
subtransactions involved to measure the impact, but I think it is
worth checking

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

04 марта 2020 г., 12:10:20

On Wed, Mar 4, 2020 at 2:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > > The first thing I realized that WAL-logging of assignments in v12 does
> > > both the "old" logging (using dedicated message) and "new" with
> > > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > > right fix.
> > >
> > > I actually proposed doing this (having both ways to log assignments) so
> > > that there's no regression risk with (wal_level < logical). But IIRC
> > > Andres objected to it, argumenting that we should not log the same piece
> > > of information in two very different ways at the same time (IIRC it was
> > > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > > And I do agree with him ...
> > >
> >
> > So, aren't we worried about the overhead of the amount of WAL and
> > performance impact for the transactions?  We might want to check the
> > pgbench read-write test to see if that will add any significant
> > overhead.
> >
>
> I have briefly looked at the original patch and it seems the
> additional overhead is only when subtransactions are involved, so
> ideally, it shouldn't impact default pgbench, but there is no harm in
> checking.  It might be that we need to build a custom script with
> subtransactions involved to measure the impact, but I think it is
> worth checking

I agree.  I will test the same and post the results.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

05 марта 2020 г., 20:50:32

On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote:
>On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> I started looking at this patch series again, hoping to get it moving
>> for PG13.
>>
>
>It is good to keep moving this forward, but there are quite a few
>problems with the design which need a broader discussion.  Some of
>what I recall are:
>a. Handling of abort of concurrent transactions.  There is some code
>in the patch which might work, but there is not much discussion when
>it was posted.
>b. Handling of partial tuples (while streaming, we came to know that
>toast tuple is not complete or speculative insert is incomplete).  For
>this also, we have proposed a few solutions which need further
>discussion.  One of those is implemented in the patch series.
>c. We might also need some handling for replication origins.
>d. Try to minimize the performance overhead of WAL logging for
>invalidations.  We discussed different solutions for this and
>implemented one of those.
>e. How to skip already streamed transactions.
>
>There might be a few more which I can't recall now.  Apart from this,
>I haven't done any detailed review of subscriber-side implementation
>where we write streamed transactions to file.  All of this will need
>much more discussion and review before we can say it is ready to
>commit, so I thought it might be better to pick it up for PG14 and
>focus on other things that have a better chance for PG13 especially
>because all the problems were not solved/discussed before last CF.
>However, it is a good idea to keep moving this and have a discussion
>on some of these issues.
>

Sure, there's a lot to discuss. And it's possible (likely) it's not
feasible to get this into PG13. But I think it's still worth discussing
it, instead of just punting it into the next CF right away.

>> There's been a tremendous amount of work done since I last
>> worked on it, and a lot was discussed on this thread, so it'll take a
>> while to get familiar with the new code ...
>>
>> The first thing I realized that WAL-logging of assignments in v12 does
>> both the "old" logging (using dedicated message) and "new" with
>> toplevel-XID embedded in the first message. Yes, the patch was wrong,
>> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
>> it was trivial to crash the replica due to KnownAssignedXids overflow.
>> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
>> right fix.
>>
>> I actually proposed doing this (having both ways to log assignments) so
>> that there's no regression risk with (wal_level < logical). But IIRC
>> Andres objected to it, argumenting that we should not log the same piece
>> of information in two very different ways at the same time (IIRC it was
>> discussed on the FOSDEM dev meeting, so I don't have a link to share).
>> And I do agree with him ...
>>
>
>So, aren't we worried about the overhead of the amount of WAL and
>performance impact for the transactions?  We might want to check the
>pgbench read-write test to see if that will add any significant
>overhead.
>

Well, sure. I agree we need to see how this affects performance, and
I'll do some benchmarks (I think I did that when submitting the patch,
but I don't recall the numbers / details).

Isn't it a bit strange to log stuff twice, though, if we worry about
performance? Surely that's more expensive than logging it just once. Of
course, it might be useful if most systems need just the "old" way.

I know it's going to be a bit hand-wavy, but I think embedding the
assignments into existing WAL messages is about the cheapest way to log
this. I would not expect this to be mesurably more expensive than what
we have now, but I might be wrong.

>> The question is, why couldn't the replica use the same assignment info
>> we already write for logical decoding?
>>
>
>I haven't thought about it in detail, but we can think on those lines
>if the performance overhead is in the acceptable range.
>

OK, let me do some measurements ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

05 марта 2020 г., 20:55:47

On Wed, Mar 04, 2020 at 09:13:49AM +0530, Dilip Kumar wrote:
>On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> Hi,
>>
>> I started looking at this patch series again, hoping to get it moving
>> for PG13.
>
>Nice.
>
> There's been a tremendous amount of work done since I last
>> worked on it, and a lot was discussed on this thread, so it'll take a
>> while to get familiar with the new code ...
>>
>> The first thing I realized that WAL-logging of assignments in v12 does
>> both the "old" logging (using dedicated message) and "new" with
>> toplevel-XID embedded in the first message. Yes, the patch was wrong,
>> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
>> it was trivial to crash the replica due to KnownAssignedXids overflow.
>> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
>> right fix.
>>
>> I actually proposed doing this (having both ways to log assignments) so
>> that there's no regression risk with (wal_level < logical). But IIRC
>> Andres objected to it, argumenting that we should not log the same piece
>> of information in two very different ways at the same time (IIRC it was
>> discussed on the FOSDEM dev meeting, so I don't have a link to share).
>> And I do agree with him ...
>>
>> The question is, why couldn't the replica use the same assignment info
>> we already write for logical decoding? The main challenge is that now
>> the assignment can be sent in many different xlog messages, from a bunch
>> of resource managers (essentially, any xlog message with a xid can have
>> embedded XID of the toplevel xact). So the handling would either need to
>> happen in every rmgr, or we need to move it before we call the rmgr.
>>
>> For exampple, we might do this e.g. in StartupXLOG() I think, per the
>> attached patch (FWIW this particular fix was written by Masahiko Sawada,
>> not me). This does the trick for me - I'm no longer able to reproduce
>> the KnownAssignedXids overflow.
>>
>> The one difference is that we used to call ProcArrayApplyXidAssignment
>> for larger groups of XIDs, as sent in the assignment message. Now we
>> call it for each individual assignment. I don't know if this is an
>> issue, but I suppose we might introduce some sort of local caching
>> (accumulate the assignments into a local array, call the function only
>> when we have enough of them).
>
>Thanks for the pointers,  I will think over these points.
>
>>
>> Aside from that, I think there's a minor bug in xact.c - the patch adds
>> a "assigned" field to TransactionStateData, but then it fails to add a
>> default value into TopTransactionStateData. We probably interpret NULL
>> as false, but then there's nothing for the pointer. I suspect it might
>> leave some random garbage there, leading to strange things later.
>
>Actually, we will never access that field for the
>TopTransactionStateData, right?
>See below code,  we have a check that only if IsSubTransaction(), then
>we access the "assigned" filed.
>
>+bool
>+IsSubTransactionAssignmentPending(void)
>+{
>+ if (!XLogLogicalInfoActive())
>+ return false;
>+
>+ /* we need to be in a transaction state */
>+ if (!IsTransactionState())
>+ return false;
>+
>+ /* it has to be a subtransaction */
>+ if (!IsSubTransaction())
>+ return false;
>+
>+ /* the subtransaction has to have a XID assigned */
>+ if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
>+ return false;
>+
>+ /* and it needs to have 'assigned' */
>+ return !CurrentTransactionState->assigned;
>+
>+}
>

The problem is not with the "assigned" field, really. AFAICS we probably
initialize it to false because we interpret NULL as false. My concern
was that we essentially leave the last pointer not initialized. That
seems like a bug, not sure if it breaks something in practice.

>>
>> Another thing I noticed is LogicalDecodingProcessRecord() extracts the
>> toplevel XID using a macro
>>
>>    txid = XLogRecGetTopXid(record);
>>
>> but then it just starts accessing the fields directly again in the
>> ReorderBufferAssignChild call. I think we should do this instead:
>>
>>      ReorderBufferAssignChild(ctx->reorder,
>>                               txid,
>>                              XLogRecGetXid(record),
>>                               buf.origptr);
>
>Make sense.  I will change this in the patch.
>

+1, thanks


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

06 марта 2020 г., 07:49:24

On Thu, Mar 5, 2020 at 11:20 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Wed, Mar 04, 2020 at 10:28:32AM +0530, Amit Kapila wrote:
> >
>
> Sure, there's a lot to discuss. And it's possible (likely) it's not
> feasible to get this into PG13. But I think it's still worth discussing
> it, instead of just punting it into the next CF right away.
>

That makes sense to me.

> >> There's been a tremendous amount of work done since I last
> >> worked on it, and a lot was discussed on this thread, so it'll take a
> >> while to get familiar with the new code ...
> >>
> >> The first thing I realized that WAL-logging of assignments in v12 does
> >> both the "old" logging (using dedicated message) and "new" with
> >> toplevel-XID embedded in the first message. Yes, the patch was wrong,
> >> because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> >> it was trivial to crash the replica due to KnownAssignedXids overflow.
> >> But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> >> right fix.
> >>
> >> I actually proposed doing this (having both ways to log assignments) so
> >> that there's no regression risk with (wal_level < logical). But IIRC
> >> Andres objected to it, argumenting that we should not log the same piece
> >> of information in two very different ways at the same time (IIRC it was
> >> discussed on the FOSDEM dev meeting, so I don't have a link to share).
> >> And I do agree with him ...
> >>
> >
> >So, aren't we worried about the overhead of the amount of WAL and
> >performance impact for the transactions?  We might want to check the
> >pgbench read-write test to see if that will add any significant
> >overhead.
> >
>
> Well, sure. I agree we need to see how this affects performance, and
> I'll do some benchmarks (I think I did that when submitting the patch,
> but I don't recall the numbers / details).
>
> Isn't it a bit strange to log stuff twice, though, if we worry about
> performance? Surely that's more expensive than logging it just once. Of
> course, it might be useful if most systems need just the "old" way.
>
> I know it's going to be a bit hand-wavy, but I think embedding the
> assignments into existing WAL messages is about the cheapest way to log
> this. I would not expect this to be mesurably more expensive than what
> we have now, but I might be wrong.
>

I agree that this shouldn't be much expensive, but it is better to be
sure in that regard.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

28 марта 2020 г., 09:25:53

On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> >
> > The first thing I realized that WAL-logging of assignments in v12 does
> > both the "old" logging (using dedicated message) and "new" with
> > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > right fix.
> >
> > I actually proposed doing this (having both ways to log assignments) so
> > that there's no regression risk with (wal_level < logical). But IIRC
> > Andres objected to it, argumenting that we should not log the same piece
> > of information in two very different ways at the same time (IIRC it was
> > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > And I do agree with him ...
> >
> > The question is, why couldn't the replica use the same assignment info
> > we already write for logical decoding? The main challenge is that now
> > the assignment can be sent in many different xlog messages, from a bunch
> > of resource managers (essentially, any xlog message with a xid can have
> > embedded XID of the toplevel xact). So the handling would either need to
> > happen in every rmgr, or we need to move it before we call the rmgr.
> >
> > For exampple, we might do this e.g. in StartupXLOG() I think, per the
> > attached patch (FWIW this particular fix was written by Masahiko Sawada,
> > not me). This does the trick for me - I'm no longer able to reproduce
> > the KnownAssignedXids overflow.
> >
> > The one difference is that we used to call ProcArrayApplyXidAssignment
> > for larger groups of XIDs, as sent in the assignment message. Now we
> > call it for each individual assignment. I don't know if this is an
> > issue, but I suppose we might introduce some sort of local caching
> > (accumulate the assignments into a local array, call the function only
> > when we have enough of them).
>
> Thanks for the pointers,  I will think over these points.
>

I have looked at the solution proposed and I would like to share my
findings.  I think calling ProcArrayApplyXidAssignment for each
subtransaction is not a good idea for a couple of reasons:
(a) It will just beat the purpose of maintaining KnowAssignedXids
array which is to avoid looking at pg_subtrans in
TransactionIdIsInProgress() on standby.  Basically, if we remove it
for each subXid, it will consider the KnowAssignedXids to be
overflowed and check pg_subtrans frequently.
(b)  Calling ProcArrayApplyXidAssignment() for each subtransaction can
be costly from the perspective of concurrency because it acquires
ProcArrayLock in Exclusive mode, so concurrently running transactions
might start blocking at this lock.  Also, I see that
SubTransSetParent() makes the page dirty, so it might lead to more
writes if we spread out setting that by calling it separately for each
sub-transaction.

Apart from this, I don't see how the proposed fix is correct because
as far as I can see it tries to remove the Xid before we even record
it via RecordKnownAssignedTransactionIds().  It seems after patch
RecordKnownAssignedTransactionIds() will be called after
ProcArrayApplyXidAssignment(), how could that be correct.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

28 марта 2020 г., 11:49:31

On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 9:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Mar 4, 2020 at 3:16 AM Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > >
> > >
> > > The first thing I realized that WAL-logging of assignments in v12 does
> > > both the "old" logging (using dedicated message) and "new" with
> > > toplevel-XID embedded in the first message. Yes, the patch was wrong,
> > > because it eliminated all calls to ProcArrayApplyXidAssignment() and so
> > > it was trivial to crash the replica due to KnownAssignedXids overflow.
> > > But I don't think re-introducing XLOG_XACT_ASSIGNMENT message is the
> > > right fix.
> > >
> > > I actually proposed doing this (having both ways to log assignments) so
> > > that there's no regression risk with (wal_level < logical). But IIRC
> > > Andres objected to it, argumenting that we should not log the same piece
> > > of information in two very different ways at the same time (IIRC it was
> > > discussed on the FOSDEM dev meeting, so I don't have a link to share).
> > > And I do agree with him ...
> > >
> > > The question is, why couldn't the replica use the same assignment info
> > > we already write for logical decoding? The main challenge is that now
> > > the assignment can be sent in many different xlog messages, from a bunch
> > > of resource managers (essentially, any xlog message with a xid can have
> > > embedded XID of the toplevel xact). So the handling would either need to
> > > happen in every rmgr, or we need to move it before we call the rmgr.
> > >
> > > For exampple, we might do this e.g. in StartupXLOG() I think, per the
> > > attached patch (FWIW this particular fix was written by Masahiko Sawada,
> > > not me). This does the trick for me - I'm no longer able to reproduce
> > > the KnownAssignedXids overflow.
> > >
> > > The one difference is that we used to call ProcArrayApplyXidAssignment
> > > for larger groups of XIDs, as sent in the assignment message. Now we
> > > call it for each individual assignment. I don't know if this is an
> > > issue, but I suppose we might introduce some sort of local caching
> > > (accumulate the assignments into a local array, call the function only
> > > when we have enough of them).
> >
> > Thanks for the pointers,  I will think over these points.
> >
>
> I have looked at the solution proposed and I would like to share my
> findings.  I think calling ProcArrayApplyXidAssignment for each
> subtransaction is not a good idea for a couple of reasons:
> (a) It will just beat the purpose of maintaining KnowAssignedXids
> array which is to avoid looking at pg_subtrans in
> TransactionIdIsInProgress() on standby.  Basically, if we remove it
> for each subXid, it will consider the KnowAssignedXids to be
> overflowed and check pg_subtrans frequently.

Right, I also think this is a problem with this solution.  I think we
may try to avoid this by caching this information.  But, then we will
have to maintain this in some dimensional array which stores
sub-transaction ids per top transaction or we can maintain a list of
sub-transaction for each transaction.  I haven't thought about how
much complexity this solution will add.

> (b)  Calling ProcArrayApplyXidAssignment() for each subtransaction can
> be costly from the perspective of concurrency because it acquires
> ProcArrayLock in Exclusive mode, so concurrently running transactions
> might start blocking at this lock.

Right

 Also, I see that
> SubTransSetParent() makes the page dirty, so it might lead to more
> writes if we spread out setting that by calling it separately for each
> sub-transaction.

Right.

>
> Apart from this, I don't see how the proposed fix is correct because
> as far as I can see it tries to remove the Xid before we even record
> it via RecordKnownAssignedTransactionIds().  It seems after patch
> RecordKnownAssignedTransactionIds() will be called after
> ProcArrayApplyXidAssignment(), how could that be correct.

Valid point.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

28 марта 2020 г., 12:59:34

On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I have looked at the solution proposed and I would like to share my
> > findings.  I think calling ProcArrayApplyXidAssignment for each
> > subtransaction is not a good idea for a couple of reasons:
> > (a) It will just beat the purpose of maintaining KnowAssignedXids
> > array which is to avoid looking at pg_subtrans in
> > TransactionIdIsInProgress() on standby.  Basically, if we remove it
> > for each subXid, it will consider the KnowAssignedXids to be
> > overflowed and check pg_subtrans frequently.
>
> Right, I also think this is a problem with this solution.  I think we
> may try to avoid this by caching this information.  But, then we will
> have to maintain this in some dimensional array which stores
> sub-transaction ids per top transaction or we can maintain a list of
> sub-transaction for each transaction.  I haven't thought about how
> much complexity this solution will add.
>

How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
flag in TransactionStateData and then log that as special information
whenever we write next WAL record for a new subtransaction?  Then
during recovery, we can only call ProcArrayApplyXidAssignment when we
find that special flag is set in a WAL record.  One idea could be to
use a flag bit in XLogRecord.xl_info.  If that is feasible then the
solution can work as it is now, without any overhead or change in the
way we maintain KnownAssignedXids.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

29 марта 2020 г., 03:59:05

On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:
>On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Sat, Mar 28, 2020 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> >
>> > I have looked at the solution proposed and I would like to share my
>> > findings.  I think calling ProcArrayApplyXidAssignment for each
>> > subtransaction is not a good idea for a couple of reasons:
>> > (a) It will just beat the purpose of maintaining KnowAssignedXids
>> > array which is to avoid looking at pg_subtrans in
>> > TransactionIdIsInProgress() on standby.  Basically, if we remove it
>> > for each subXid, it will consider the KnowAssignedXids to be
>> > overflowed and check pg_subtrans frequently.
>>
>> Right, I also think this is a problem with this solution.  I think we
>> may try to avoid this by caching this information.  But, then we will
>> have to maintain this in some dimensional array which stores
>> sub-transaction ids per top transaction or we can maintain a list of
>> sub-transaction for each transaction.  I haven't thought about how
>> much complexity this solution will add.
>>
>
>How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
>flag in TransactionStateData and then log that as special information
>whenever we write next WAL record for a new subtransaction?  Then
>during recovery, we can only call ProcArrayApplyXidAssignment when we
>find that special flag is set in a WAL record.  One idea could be to
>use a flag bit in XLogRecord.xl_info.  If that is feasible then the
>solution can work as it is now, without any overhead or change in the
>way we maintain KnownAssignedXids.
>

Ummm, how is that different from what the patch is doing now? I mean, we
only write the top-level XID for the first WAL record in each subxact,
right? Or what would be the difference with your approach?

Anyway, I think you're right the ProcArrayApplyXidAssignment call was
done too early, but I think that can be fixed by moving it until after
the RecordKnownAssignedTransactionIds call, no? Essentially, right
before rm_redo().

You're right calling ProcArrayApplyXidAssignment() may be an issue,
because it exclusively acquires the ProcArrayLock. I've actually hinted
that might be an issue in my original message, suggesting we might add a
local cache of assigned XIDs (a small static array, doing essentially
the same thing we used to do on the upstream node). I haven't done that
in my WIP patch to keep it simple, but AFACS it'd work.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

29 марта 2020 г., 08:49:21

On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:
> >On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
> >flag in TransactionStateData and then log that as special information
> >whenever we write next WAL record for a new subtransaction?  Then
> >during recovery, we can only call ProcArrayApplyXidAssignment when we
> >find that special flag is set in a WAL record.  One idea could be to
> >use a flag bit in XLogRecord.xl_info.  If that is feasible then the
> >solution can work as it is now, without any overhead or change in the
> >way we maintain KnownAssignedXids.
> >
>
> Ummm, how is that different from what the patch is doing now? I mean, we
> only write the top-level XID for the first WAL record in each subxact,
> right? Or what would be the difference with your approach?
>

We have to do what the patch is currently doing and additionally, we
will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
us to call ProcArrayApplyXidAssignment during WAL replay only after
PGPROC_MAX_CACHED_SUBXIDS number of subxacts.  It will help us in
clearing the KnownAssignedXids at the same time as we do now, so no
additional performance overhead.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

29 марта 2020 г., 18:31:05

On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
>On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sat, Mar 28, 2020 at 03:29:34PM +0530, Amit Kapila wrote:
>> >On Sat, Mar 28, 2020 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> >How about if instead of writing an XLOG_XACT_ASSIGNMENT WAL, we set a
>> >flag in TransactionStateData and then log that as special information
>> >whenever we write next WAL record for a new subtransaction?  Then
>> >during recovery, we can only call ProcArrayApplyXidAssignment when we
>> >find that special flag is set in a WAL record.  One idea could be to
>> >use a flag bit in XLogRecord.xl_info.  If that is feasible then the
>> >solution can work as it is now, without any overhead or change in the
>> >way we maintain KnownAssignedXids.
>> >
>>
>> Ummm, how is that different from what the patch is doing now? I mean, we
>> only write the top-level XID for the first WAL record in each subxact,
>> right? Or what would be the difference with your approach?
>>
>
>We have to do what the patch is currently doing and additionally, we
>will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
>us to call ProcArrayApplyXidAssignment during WAL replay only after
>PGPROC_MAX_CACHED_SUBXIDS number of subxacts.  It will help us in
>clearing the KnownAssignedXids at the same time as we do now, so no
>additional performance overhead.
>

Hmmm. So we'd still log assignment twice? Or would we keep just the
immediate assignments (embedded into xlog records), and cache the
subxids on the replica somehow?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

30 марта 2020 г., 09:17:57

On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
> >On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> Ummm, how is that different from what the patch is doing now? I mean, we
> >> only write the top-level XID for the first WAL record in each subxact,
> >> right? Or what would be the difference with your approach?
> >>
> >
> >We have to do what the patch is currently doing and additionally, we
> >will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
> >us to call ProcArrayApplyXidAssignment during WAL replay only after
> >PGPROC_MAX_CACHED_SUBXIDS number of subxacts.  It will help us in
> >clearing the KnownAssignedXids at the same time as we do now, so no
> >additional performance overhead.
> >
>
> Hmmm. So we'd still log assignment twice? Or would we keep just the
> immediate assignments (embedded into xlog records), and cache the
> subxids on the replica somehow?
>

I think we need to cache the subxids on the replica somehow but I
don't have a very good idea for it.  Basically, there are two ways to
do it (a) Change the KnownAssignedXids in some way so that we can
easily find this information without losing on the current benefits of
it.  I can't think of a good way to do that and even if we come up
with something, it could easily be a lot of work, (b) Cache the
subxids for a particular transaction in local memory along with
KnownAssignedXids.  This is doable but now we have two data-structures
(one in shared memory and other in local memory) managing the same
information in different ways.

Do you have any other ideas?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

30 марта 2020 г., 18:27:58

On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
>On Sun, Mar 29, 2020 at 9:01 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Sun, Mar 29, 2020 at 11:19:21AM +0530, Amit Kapila wrote:
>> >On Sun, Mar 29, 2020 at 6:29 AM Tomas Vondra
>> ><tomas.vondra@2ndquadrant.com> wrote:
>> >>
>> >> Ummm, how is that different from what the patch is doing now? I mean, we
>> >> only write the top-level XID for the first WAL record in each subxact,
>> >> right? Or what would be the difference with your approach?
>> >>
>> >
>> >We have to do what the patch is currently doing and additionally, we
>> >will set this flag after PGPROC_MAX_CACHED_SUBXIDS which would allow
>> >us to call ProcArrayApplyXidAssignment during WAL replay only after
>> >PGPROC_MAX_CACHED_SUBXIDS number of subxacts.  It will help us in
>> >clearing the KnownAssignedXids at the same time as we do now, so no
>> >additional performance overhead.
>> >
>>
>> Hmmm. So we'd still log assignment twice? Or would we keep just the
>> immediate assignments (embedded into xlog records), and cache the
>> subxids on the replica somehow?
>>
>
>I think we need to cache the subxids on the replica somehow but I
>don't have a very good idea for it.  Basically, there are two ways to
>do it (a) Change the KnownAssignedXids in some way so that we can
>easily find this information without losing on the current benefits of
>it.  I can't think of a good way to do that and even if we come up
>with something, it could easily be a lot of work, (b) Cache the
>subxids for a particular transaction in local memory along with
>KnownAssignedXids.  This is doable but now we have two data-structures
>(one in shared memory and other in local memory) managing the same
>information in different ways.
>
>Do you have any other ideas?

I don't follow. Why couldn't we have a simple cache on the standby? It
could be either a simple array or a hash table (with the top-level xid
as hash key)?

I think the single array would be sufficient, but the hash table would
allow keeping the apply logic more or less as it's today. See the
attached patch that adds such cache - I do admit I haven't tested this,
but hopefully it's a sufficient illustration of the idea.

It does not handle cleanup of the cache, but I think that should not be
difficult - we simply need to remove entries for transactions that got
committed or rolled back. And do something about transactions without an
explicit commit/rollback record, but that can be done by also handling
XLOG_RUNNING_XACTS (by removing anything preceding oldestRunningXid).

I don't think this is particularly complicated or a lot of code, and I
don't see why would it require data structures in shared memory. Only
the walreceiver on standby needs to worry about this, no?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

xid-assignment-v13-fix.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

07 апреля 2020 г., 09:47:44

On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
> >
> >I think we need to cache the subxids on the replica somehow but I
> >don't have a very good idea for it.  Basically, there are two ways to
> >do it (a) Change the KnownAssignedXids in some way so that we can
> >easily find this information without losing on the current benefits of
> >it.  I can't think of a good way to do that and even if we come up
> >with something, it could easily be a lot of work, (b) Cache the
> >subxids for a particular transaction in local memory along with
> >KnownAssignedXids.  This is doable but now we have two data-structures
> >(one in shared memory and other in local memory) managing the same
> >information in different ways.
> >
> >Do you have any other ideas?
>
> I don't follow. Why couldn't we have a simple cache on the standby? It
> could be either a simple array or a hash table (with the top-level xid
> as hash key)?
>

I think having something like we discussed or what you have in the
patch won't be sufficient to clean the KnownAssignedXid array. The
point is that we won't write a WAL for xid-subxid association for
unlogged relations in the "Immediately WAL-log assignments" patch,
however, the KnownAssignedXid would have both kinds of Xids as we
autofill it with gaps (see RecordKnownAssignedTransactionIds).  I
think if my understanding is correct to make it work we might need
major surgery in the code or have to maintain KnownAssignedXid array
differently.

>
> I don't think this is particularly complicated or a lot of code, and I
> don't see why would it require data structures in shared memory. Only
> the walreceiver on standby needs to worry about this, no?
>

Not a new data structure in shared memory, but we already have a
KnownTransactionId structure in shared memory. So, after having a
local cache, we will have xidAssignmentsHash and KnownTransactionId
maintaining the same information in different ways.  And, we need to
ensure both are cleaned up properly. That was what I was pointing
above related to maintaining two structures.  However, I think before
discussing more on this, we need to think about the above problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

08 апреля 2020 г., 03:59:05

On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:
>On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
>> >
>> >I think we need to cache the subxids on the replica somehow but I
>> >don't have a very good idea for it.  Basically, there are two ways to
>> >do it (a) Change the KnownAssignedXids in some way so that we can
>> >easily find this information without losing on the current benefits of
>> >it.  I can't think of a good way to do that and even if we come up
>> >with something, it could easily be a lot of work, (b) Cache the
>> >subxids for a particular transaction in local memory along with
>> >KnownAssignedXids.  This is doable but now we have two data-structures
>> >(one in shared memory and other in local memory) managing the same
>> >information in different ways.
>> >
>> >Do you have any other ideas?
>>
>> I don't follow. Why couldn't we have a simple cache on the standby? It
>> could be either a simple array or a hash table (with the top-level xid
>> as hash key)?
>>
>
>I think having something like we discussed or what you have in the
>patch won't be sufficient to clean the KnownAssignedXid array. The
>point is that we won't write a WAL for xid-subxid association for
>unlogged relations in the "Immediately WAL-log assignments" patch,
>however, the KnownAssignedXid would have both kinds of Xids as we
>autofill it with gaps (see RecordKnownAssignedTransactionIds).  I
>think if my understanding is correct to make it work we might need
>major surgery in the code or have to maintain KnownAssignedXid array
>differently.

Hmm, that's a good point. If I understand correctly, the issue is
that if we create new subxact, write something into an unlogged table,
and then create new subxact, the XID of the first subxact will be "known
assigned" but we won't know it's a subxact or to which parent xact it
belongs (because there will be no WAL records that could encode it).

I wonder if there's a simple solution (e.g. when creating the second
subxact we might notice the xid-subxid assignment was not logged, and
write some "dummy" WAL record). But I admit it seems a bit ugly.

>>
>> I don't think this is particularly complicated or a lot of code, and I
>> don't see why would it require data structures in shared memory. Only
>> the walreceiver on standby needs to worry about this, no?
>>
>
>Not a new data structure in shared memory, but we already have a
>KnownTransactionId structure in shared memory. So, after having a
>local cache, we will have xidAssignmentsHash and KnownTransactionId
>maintaining the same information in different ways.  And, we need to
>ensure both are cleaned up properly. That was what I was pointing
>above related to maintaining two structures.  However, I think before
>discussing more on this, we need to think about the above problem.
>

Sure.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

09 апреля 2020 г., 12:10:17

On Wed, Apr 8, 2020 at 6:29 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> On Tue, Apr 07, 2020 at 12:17:44PM +0530, Amit Kapila wrote:
> >On Mon, Mar 30, 2020 at 8:58 PM Tomas Vondra
> ><tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> On Mon, Mar 30, 2020 at 11:47:57AM +0530, Amit Kapila wrote:
> >> >
> >> >I think we need to cache the subxids on the replica somehow but I
> >> >don't have a very good idea for it.  Basically, there are two ways to
> >> >do it (a) Change the KnownAssignedXids in some way so that we can
> >> >easily find this information without losing on the current benefits of
> >> >it.  I can't think of a good way to do that and even if we come up
> >> >with something, it could easily be a lot of work, (b) Cache the
> >> >subxids for a particular transaction in local memory along with
> >> >KnownAssignedXids.  This is doable but now we have two data-structures
> >> >(one in shared memory and other in local memory) managing the same
> >> >information in different ways.
> >> >
> >> >Do you have any other ideas?
> >>
> >> I don't follow. Why couldn't we have a simple cache on the standby? It
> >> could be either a simple array or a hash table (with the top-level xid
> >> as hash key)?
> >>
> >
> >I think having something like we discussed or what you have in the
> >patch won't be sufficient to clean the KnownAssignedXid array. The
> >point is that we won't write a WAL for xid-subxid association for
> >unlogged relations in the "Immediately WAL-log assignments" patch,
> >however, the KnownAssignedXid would have both kinds of Xids as we
> >autofill it with gaps (see RecordKnownAssignedTransactionIds).  I
> >think if my understanding is correct to make it work we might need
> >major surgery in the code or have to maintain KnownAssignedXid array
> >differently.
>
> Hmm, that's a good point. If I understand correctly, the issue is
> that if we create new subxact, write something into an unlogged table,
> and then create new subxact, the XID of the first subxact will be "known
> assigned" but we won't know it's a subxact or to which parent xact it
> belongs (because there will be no WAL records that could encode it).
>
> I wonder if there's a simple solution (e.g. when creating the second
> subxact we might notice the xid-subxid assignment was not logged, and
> write some "dummy" WAL record). But I admit it seems a bit ugly.
>
> >>
> >> I don't think this is particularly complicated or a lot of code, and I
> >> don't see why would it require data structures in shared memory. Only
> >> the walreceiver on standby needs to worry about this, no?
> >>
> >
> >Not a new data structure in shared memory, but we already have a
> >KnownTransactionId structure in shared memory. So, after having a
> >local cache, we will have xidAssignmentsHash and KnownTransactionId
> >maintaining the same information in different ways.  And, we need to
> >ensure both are cleaned up properly. That was what I was pointing
> >above related to maintaining two structures.  However, I think before
> >discussing more on this, we need to think about the above problem.

I have rebased the patch on the latest head.  I haven't yet changed
anything for xid assignment thing because it is not yet concluded.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Mon, Apr 13, 2020 at 11:43 PM Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
>
> On Mon, Apr 13, 2020 at 6:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> Skipping 0003 for now. Review comments from 0004-Gracefully-handle-*.patch
>
> @@ -5490,6 +5523,14 @@ heap_finish_speculative(Relation relation,
> ItemPointer tid)
>   ItemId lp = NULL;
>   HeapTupleHeader htup;
>
> + /*
> + * We don't expect direct calls to heap_hot_search with
> + * valid CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> + elog(ERROR, "unexpected heap_hot_search call during logical decoding");
> The call is to heap_finish_speculative.

Fixed

> @@ -481,6 +482,19 @@ systable_getnext(SysScanDesc sysscan)
>   }
>   }
>
> + if (TransactionIdIsValid(CheckXidAlive) &&
> + !TransactionIdIsInProgress(CheckXidAlive) &&
> + !TransactionIdDidCommit(CheckXidAlive))
> + ereport(ERROR,
> + (errcode(ERRCODE_TRANSACTION_ROLLBACK),
> + errmsg("transaction aborted during system catalog scan")));
> s/transaction aborted/transaction aborted concurrently perhaps? Also,
> can we move this check at the begining of the function? If the
> condition fails, we can skip the sys scan.

We must check this after we get the tuple because our goal is, not to
decode based on the wrong tuple.  And, if we move the check before
then, what if the transaction aborted after the check.   Once we get
the tuple and if the transaction is alive by that time then it doesn't
matter even if it aborts because we have got the right tuple already.

>
> Some of the checks looks repetative in the same file. Should we
> declare them as inline functions?
>
> Review comments from 0005-Implement-streaming*.patch
>
> +static void
> +AssertChangeLsnOrder(ReorderBufferTXN *txn)
> +{
> +#ifdef USE_ASSERT_CHECKING
> + dlist_iter iter;
> ...
> +#endif
> +}
>
> We can implement the same as following:
> #ifdef USE_ASSERT_CHECKING
> static void
> AssertChangeLsnOrder(ReorderBufferTXN *txn)
> {
> dlist_iter iter;
> ...
> }
> #else
> #define AssertChangeLsnOrder(txn) ((void)true)
> #endif

I am not sure, this doesn't look clean.  Moreover, the other similar
functions are defined in the same way. e.g. AssertTXNLsnOrder.

>
> + * if it is aborted we will report an specific error which we can ignore.  We
> s/an specific/a specific

Done

>
> + * Set the last last of the stream as the final lsn before calling
> + * stream stop.
> s/last last/last
>
>   PG_CATCH();
>   {
> + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> + ErrorData  *errdata = CopyErrorData();
> When we don't re-throw, the errdata should be freed by calling
> FreeErrorData(errdata), right?

Done


>
> + /*
> + * Set the last last of the stream as the final lsn before
> + * calling stream stop.
> + */
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> +
> + FlushErrorState();
> + }
> stream_stop() can still throw some error, right? In that case, we
> should flush the error state before calling stream_stop().

Done

>
> + /*
> + * Remember the command ID and snapshot if transaction is streaming
> + * otherwise free the snapshot if we have copied it.
> + */
> + if (streaming)
> + {
> + txn->command_id = command_id;
> +
> + /* Avoid copying if it's already copied. */
> + if (snapshot_now->copied)
> + txn->snapshot_now = snapshot_now;
> + else
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +  txn, command_id);
> + }
> + else if (snapshot_now->copied)
> + ReorderBufferFreeSnap(rb, snapshot_now);
> Hmm, it seems this part needs an assumption that after copying the
> snapshot, no subsequent step can throw any error. If they do, then we
> can again create a copy of the snapshot in catch block, which will
> leak some memory. Is my understanding correct?

Actually, In CATCH we copy only if the error is
ERRCODE_TRANSACTION_ROLLBACK.  And, that can occur during systable
scan.  Basically, in TRY block we copy snapshot after we have streamed
all the changes i.e. systable scan is done, now if there is any error
that will not be ERRCODE_TRANSACTION_ROLLBACK.  So we will not copy
again.

>
> + }
> + else
> + {
> + ReorderBufferCleanupTXN(rb, txn);
> + PG_RE_THROW();
> + }
> Shouldn't we switch back to previously created error memory context
> before re-throwing?

Fixed.

>
> +ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
> + XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
> + TimestampTz commit_time,
> + RepOriginId origin_id, XLogRecPtr origin_lsn)
> +{
> + ReorderBufferTXN *txn;
> + volatile Snapshot snapshot_now;
> + volatile CommandId command_id = FirstCommandId;
> In the modified ReorderBufferCommit(), why is it necessary to declare
> the above two variable as volatile? There is no try-catch block here.

Fixed
>
> @@ -1946,6 +2284,13 @@ ReorderBufferAbort(ReorderBuffer *rb,
> TransactionId xid, XLogRecPtr lsn)
>   if (txn == NULL)
>   return;
>
> + /*
> + * When the (sub)transaction was streamed, notify the remote node
> + * about the abort only if we have sent any data for this transaction.
> + */
> + if (rbtxn_is_streamed(txn) && txn->any_data_sent)
> + rb->stream_abort(rb, txn, lsn);
> +
> s/When/If
>
> + /*
> + * When the (sub)transaction was streamed, notify the remote node
> + * about the abort.
> + */
> + if (rbtxn_is_streamed(txn))
> + rb->stream_abort(rb, txn, lsn);
> s/When/If. And, in this case, if we've not sent any data, why should
> we send the abort message (similar to the previous one)?

Fixed

>
> + * Note: We never do both stream and serialize a transaction (we only spill
> + * to disk when streaming is not supported by the plugin), so only one of
> + * those two flags may be set at any given time.
> + */
> +#define rbtxn_is_streamed(txn) \
> +( \
> + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
> +)
> Should we put any assert (not necessarily here) to validate the above comment?

Because of toast handling, this assumption is changed now so I will
remove this note in that patch (0010).

>
> + txn = ReorderBufferLargestTopTXN(rb);
> +
> + /* we know there has to be one, because the size is not zero */
> + Assert(txn && !txn->toptxn);
> + Assert(txn->size > 0);
> + Assert(rb->size >= txn->size);
> The same three assertions are already there in ReorderBufferLargestTopTXN().
>
> +static bool
> +ReorderBufferCanStream(ReorderBuffer *rb)
> +{
> + LogicalDecodingContext *ctx = rb->private_data;
> +
> + return ctx->streaming;
> +}
> Potential inline function.

Done

> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +{
> + volatile Snapshot snapshot_now;
> + volatile CommandId command_id;
> Here also, do we need to declare these two variables as volatile?

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-14 12:10, Dilip Kumar wrote:
>
> > v14-0001-Immediately-WAL-log-assignments.patch                 +
> > v14-0002-Issue-individual-invalidations-with.patch             +
> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch       +
> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
> > v14-0007-Track-statistics-for-streaming.patch                  +
> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              +
> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>
> applied on top of 8128b0c (a few hours ago)


Hi Erik,

While setting up the cascading replication I have hit one issue on
base code[1].  After fixing that I have got one crash with streaming
on patch.  I am not sure whether you are facing any of these 2 issues
or any other issue.  If your issue is not any of these then plese
share the callstack and steps to reproduce.

[1] https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

bugfix_in_schema_sent.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Erik Rijkers

Дата:

16 апреля 2020 г., 12:46:24

On 2020-04-16 11:33, Dilip Kumar wrote:
> On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:
>> 
>> On 2020-04-14 12:10, Dilip Kumar wrote:
>> 
>> > v14-0001-Immediately-WAL-log-assignments.patch                 +
>> > v14-0002-Issue-individual-invalidations-with.patch             +
>> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
>> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
>> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch       +
>> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
>> > v14-0007-Track-statistics-for-streaming.patch                  +
>> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
>> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              +
>> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>> 
>> applied on top of 8128b0c (a few hours ago)
> 

I've added your new patch

[bugfix_replica_identity_full_on_subscriber.patch]

on top of all those above but the crash (apparently the same crash) that 
I had earlier still occurs (and pretty soon).

server process (PID 1721) was terminated by signal 11: Segmentation 
fault

I'll try to isolate it better and get a stacktrace


> Hi Erik,
> 
> While setting up the cascading replication I have hit one issue on
> base code[1].  After fixing that I have got one crash with streaming
> on patch.  I am not sure whether you are facing any of these 2 issues
> or any other issue.  If your issue is not any of these then plese
> share the callstack and steps to reproduce.
> 
> [1]
> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
> 
> 
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Kuntal Ghosh

Дата:

16 апреля 2020 г., 23:10:09

On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>

Few review comments from 0006-Add-support-for-streaming*.patch

+ subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
lseek can return (-)ve value in case of error, right?

+ /*
+ * We might need to create the tablespace's tempfile directory, if no
+ * one has yet done so.
+ *
+ * Don't check for error from mkdir; it could fail if the directory
+ * already exists (maybe someone else just did the same thing).  If
+ * it doesn't work then we'll bomb out when opening the file
+ */
+ mkdir(tempdirpath, S_IRWXU);
If that's the only reason, perhaps we can use something like following:

if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
throw error;

+
+ CloseTransientFile(stream_fd);
Might failed to close the file. We should handle the case.

Also, I think we need some implementations in dumpSubscription() to
dump the (streaming = 'on') option.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Tomas Vondra

Дата:

17 апреля 2020 г., 00:25:04

On Mon, Apr 13, 2020 at 05:20:39PM +0530, Dilip Kumar wrote:
>On Mon, Apr 13, 2020 at 4:14 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>
>> On Thu, Apr 9, 2020 at 2:40 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> > I have rebased the patch on the latest head.  I haven't yet changed
>> > anything for xid assignment thing because it is not yet concluded.
>> >
>> Some review comments from 0001-Immediately-WAL-log-*.patch,
>>
>> +bool
>> +IsSubTransactionAssignmentPending(void)
>> +{
>> + if (!XLogLogicalInfoActive())
>> + return false;
>> +
>> + /* we need to be in a transaction state */
>> + if (!IsTransactionState())
>> + return false;
>> +
>> + /* it has to be a subtransaction */
>> + if (!IsSubTransaction())
>> + return false;
>> +
>> + /* the subtransaction has to have a XID assigned */
>> + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
>> + return false;
>> +
>> + /* and it needs to have 'assigned' */
>> + return !CurrentTransactionState->assigned;
>> +
>> +}
>> IMHO, it's important to reduce the complexity of this function since
>> it's been called for every WAL insertion. During the lifespan of a
>> transaction, any of these if conditions will only be evaluated if
>> previous conditions are true. So, we can maintain some state machine
>> to avoid multiple evaluation of a condition inside a transaction. But,
>> if the overhead is not much, it's not worth I guess.
>
>Yeah maybe, in some cases we can avoid checking multiple conditions by
>maintaining that state.  But, that state will have to be at the
>transaction level.  But, I am not sure how much worth it will be to
>add one extra condition to skip a few if checks and it will also add
>the code complexity.  And, in some cases where logical decoding is not
>enabled, it may add one extra check? I mean first check the state and
>that will take you to the first if check.
>

Perhaps. I think we should only do that if we can demonstrate it's an
issue in practice. Otherwise it's just unnecessary complexity.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Erik Rijkers

Дата:

18 апреля 2020 г., 12:07:48

On 2020-04-16 11:46, Erik Rijkers wrote:
> On 2020-04-16 11:33, Dilip Kumar wrote:
>> On Tue, Apr 14, 2020 at 9:14 PM Erik Rijkers <er@xs4all.nl> wrote:
>>> 
>>> On 2020-04-14 12:10, Dilip Kumar wrote:
>>> 
>>> > v14-0001-Immediately-WAL-log-assignments.patch                 +
>>> > v14-0002-Issue-individual-invalidations-with.patch             +
>>> > v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
>>> > v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
>>> > v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch       +
>>> > v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
>>> > v14-0007-Track-statistics-for-streaming.patch                  +
>>> > v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch +
>>> > v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              +
>>> > v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch
>>> 
>>> applied on top of 8128b0c (a few hours ago)
>> 
> 
> I've added your new patch
> 
> [bugfix_replica_identity_full_on_subscriber.patch]
> 
> on top of all those above but the crash (apparently the same crash)
> that I had earlier still occurs (and pretty soon).
> 
> server process (PID 1721) was terminated by signal 11: Segmentation 
> fault
> 
> I'll try to isolate it better and get a stacktrace
> 
> 
>> Hi Erik,
>> 
>> While setting up the cascading replication I have hit one issue on
>> base code[1].  After fixing that I have got one crash with streaming
>> on patch.  I am not sure whether you are facing any of these 2 issues
>> or any other issue.  If your issue is not any of these then plese
>> share the callstack and steps to reproduce.

I figured out a few things about this. Attached is a bash script 
test.sh, to reproduce:

There is a variable  CRASH_IT  that determines whether the whole thing 
will fail (with a segmentation fault) or not.  As attached it has  
CRASH_IT=0 and does not crash.  When you change that to CRASH_IT=1, then 
it will crash.  It turns out that this just depends on a short wait 
state (3 seconds, on my machine) between setting up de replication, and 
the running of pgbench.  It's possible that on very fast machines maybe 
it does not occur; we've had such difference between hardware before. 
This is a i5-3330S.

It deletes files so look it over before you run it.  It may also depend 
on some of my local set-up but I guess that should be easily fixed.

Can you let me know if you can reproduce the problem with this?

thanks,

Erik Rijkers



>> 
>> [1]
>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>> 
>> 
>> --
>> Regards,
>> Dilip Kumar
>> EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Erik Rijkers

Дата:

18 апреля 2020 г., 12:10:50

On 2020-04-18 11:07, Erik Rijkers wrote:
>>> Hi Erik,
>>> 
>>> While setting up the cascading replication I have hit one issue on
>>> base code[1].  After fixing that I have got one crash with streaming
>>> on patch.  I am not sure whether you are facing any of these 2 issues
>>> or any other issue.  If your issue is not any of these then plese
>>> share the callstack and steps to reproduce.
> 
> I figured out a few things about this. Attached is a bash script
> test.sh, to reproduce:

And the attached file, test.sh.  (sorry)

> There is a variable  CRASH_IT  that determines whether the whole thing
> will fail (with a segmentation fault) or not.  As attached it has
> CRASH_IT=0 and does not crash.  When you change that to CRASH_IT=1,
> then it will crash.  It turns out that this just depends on a short
> wait state (3 seconds, on my machine) between setting up de
> replication, and the running of pgbench.  It's possible that on very
> fast machines maybe it does not occur; we've had such difference
> between hardware before. This is a i5-3330S.
> 
> It deletes files so look it over before you run it.  It may also
> depend on some of my local set-up but I guess that should be easily
> fixed.
> 
> Can you let me know if you can reproduce the problem with this?
> 
> thanks,
> 
> Erik Rijkers
> 
> 
> 
>>> 
>>> [1]
>>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>>> 
>>> 
>>> --
>>> Regards,
>>> Dilip Kumar
>>> EnterpriseDB: http://www.enterprisedb.com

Вложения

test.sh

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Erik Rijkers

Дата:

18 апреля 2020 г., 15:42:32

On 2020-04-18 11:10, Erik Rijkers wrote:
> On 2020-04-18 11:07, Erik Rijkers wrote:
>>>> Hi Erik,
>>>> 
>>>> While setting up the cascading replication I have hit one issue on
>>>> base code[1].  After fixing that I have got one crash with streaming
>>>> on patch.  I am not sure whether you are facing any of these 2 
>>>> issues
>>>> or any other issue.  If your issue is not any of these then plese
>>>> share the callstack and steps to reproduce.
>> 
>> I figured out a few things about this. Attached is a bash script
>> test.sh, to reproduce:
> 
> And the attached file, test.sh.  (sorry)

It turns out I must have been mistaken somewhere.  I probably missed 
bugfix_in_schema_sent.patch)

I have just now rebuilt all the instances on top of master with these 
patches:

> [v14-0001-Immediately-WAL-log-assignments.patch]
> [v14-0002-Issue-individual-invalidations-with.patch]
> [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> [v14-0007-Track-statistics-for-streaming.patch]
> [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> [bugfix_in_schema_sent.patch]

    (by the way: this build's regression tests  'ddl', 'toast', and 
'spill' fail)

I seem now able to run all my test programs on these instances without 
errors.

Sorry, I seem to have raised a false alarm (although there was initially 
certainly a problem).


Erik Rijkers



>> There is a variable  CRASH_IT  that determines whether the whole thing
>> will fail (with a segmentation fault) or not.  As attached it has
>> CRASH_IT=0 and does not crash.  When you change that to CRASH_IT=1,
>> then it will crash.  It turns out that this just depends on a short
>> wait state (3 seconds, on my machine) between setting up de
>> replication, and the running of pgbench.  It's possible that on very
>> fast machines maybe it does not occur; we've had such difference
>> between hardware before. This is a i5-3330S.
>> 
>> It deletes files so look it over before you run it.  It may also
>> depend on some of my local set-up but I guess that should be easily
>> fixed.
>> 
>> Can you let me know if you can reproduce the problem with this?
>> 
>> thanks,
>> 
>> Erik Rijkers
>> 
>> 
>> 
>>>> 
>>>> [1]
>>>> https://www.postgresql.org/message-id/CAFiTN-u64S5bUiPL1q5kwpHNd0hRnf1OE-bzxNiOs5zo84i51w%40mail.gmail.com
>>>> 
>>>> 
>>>> --
>>>> Regards,
>>>> Dilip Kumar
>>>> EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

18 апреля 2020 г., 17:01:37

On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-18 11:10, Erik Rijkers wrote:
> > On 2020-04-18 11:07, Erik Rijkers wrote:
> >>>> Hi Erik,
> >>>>
> >>>> While setting up the cascading replication I have hit one issue on
> >>>> base code[1].  After fixing that I have got one crash with streaming
> >>>> on patch.  I am not sure whether you are facing any of these 2
> >>>> issues
> >>>> or any other issue.  If your issue is not any of these then plese
> >>>> share the callstack and steps to reproduce.
> >>
> >> I figured out a few things about this. Attached is a bash script
> >> test.sh, to reproduce:
> >
> > And the attached file, test.sh.  (sorry)
>
> It turns out I must have been mistaken somewhere.  I probably missed
> bugfix_in_schema_sent.patch)
>
> I have just now rebuilt all the instances on top of master with these
> patches:
>
> > [v14-0001-Immediately-WAL-log-assignments.patch]
> > [v14-0002-Issue-individual-invalidations-with.patch]
> > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > [v14-0007-Track-statistics-for-streaming.patch]
> > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > [bugfix_in_schema_sent.patch]
>
>     (by the way: this build's regression tests  'ddl', 'toast', and
> 'spill' fail)
>
> I seem now able to run all my test programs on these instances without
> errors.
>
> Sorry, I seem to have raised a false alarm (although there was initially
> certainly a problem).

No problem,  Thanks for confirming.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

21 апреля 2020 г., 15:00:35

On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-18 11:10, Erik Rijkers wrote:
> > On 2020-04-18 11:07, Erik Rijkers wrote:
> >>>> Hi Erik,
> >>>>
> >>>> While setting up the cascading replication I have hit one issue on
> >>>> base code[1].  After fixing that I have got one crash with streaming
> >>>> on patch.  I am not sure whether you are facing any of these 2
> >>>> issues
> >>>> or any other issue.  If your issue is not any of these then plese
> >>>> share the callstack and steps to reproduce.
> >>
> >> I figured out a few things about this. Attached is a bash script
> >> test.sh, to reproduce:
> >
> > And the attached file, test.sh.  (sorry)
>
> It turns out I must have been mistaken somewhere.  I probably missed
> bugfix_in_schema_sent.patch)
>
> I have just now rebuilt all the instances on top of master with these
> patches:
>
> > [v14-0001-Immediately-WAL-log-assignments.patch]
> > [v14-0002-Issue-individual-invalidations-with.patch]
> > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > [v14-0007-Track-statistics-for-streaming.patch]
> > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > [bugfix_in_schema_sent.patch]
>
>     (by the way: this build's regression tests  'ddl', 'toast', and
> 'spill' fail)

Yeah, this is a. known issue, actually, while streaming the
transaction the output message is changed.  I have a plan to work on
this part.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

22 апреля 2020 г., 17:49:28

On Tue, Apr 21, 2020 at 5:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Apr 18, 2020 at 6:12 PM Erik Rijkers <er@xs4all.nl> wrote:
> >
> > On 2020-04-18 11:10, Erik Rijkers wrote:
> > > On 2020-04-18 11:07, Erik Rijkers wrote:
> > >>>> Hi Erik,
> > >>>>
> > >>>> While setting up the cascading replication I have hit one issue on
> > >>>> base code[1].  After fixing that I have got one crash with streaming
> > >>>> on patch.  I am not sure whether you are facing any of these 2
> > >>>> issues
> > >>>> or any other issue.  If your issue is not any of these then plese
> > >>>> share the callstack and steps to reproduce.
> > >>
> > >> I figured out a few things about this. Attached is a bash script
> > >> test.sh, to reproduce:
> > >
> > > And the attached file, test.sh.  (sorry)
> >
> > It turns out I must have been mistaken somewhere.  I probably missed
> > bugfix_in_schema_sent.patch)
> >
> > I have just now rebuilt all the instances on top of master with these
> > patches:
> >
> > > [v14-0001-Immediately-WAL-log-assignments.patch]
> > > [v14-0002-Issue-individual-invalidations-with.patch]
> > > [v14-0003-Extend-the-output-plugin-API-with-stream-methods.patch]
> > > [v14-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch]
> > > [v14-0005-Implement-streaming-mode-in-ReorderBuffer.patch]
> > > [v14-0006-Add-support-for-streaming-to-built-in-replicatio.patch]
> > > [v14-0007-Track-statistics-for-streaming.patch]
> > > [v14-0008-Enable-streaming-for-all-subscription-TAP-tests.patch]
> > > [v14-0009-Add-TAP-test-for-streaming-vs.-DDL.patch]
> > > [v14-0010-Bugfix-handling-of-incomplete-toast-tuple.patch]
> > > [bugfix_in_schema_sent.patch]
> >
> >     (by the way: this build's regression tests  'ddl', 'toast', and
> > 'spill' fail)
>
> Yeah, this is a. known issue, actually, while streaming the
> transaction the output message is changed.  I have a plan to work on
> this part.

I have fixed this part.  Basically, now, I have created a separate
function to get the streaming changes
'pg_logical_slot_get_streaming_changes'.  So the default function
pg_logical_slot_get_changes will work as it is and test decoding test
cases will not fail.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On 2020-04-23 05:24, Dilip Kumar wrote:
> On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
>> 
>> The 'ddl' one is apparently not quite fixed  - I get this in (cd
>> contrib; make check)' (in both assert-enabled and non-assert-enabled
>> build)
> 
> Can you send me the contrib/test_decoding/regression.diffs file?

Attached.


Below is the patch list, in case that was unclear

20200422/v15-0001-Immediately-WAL-log-assignments.patch                 
+
20200422/v15-0002-Issue-individual-invalidations-with-wal_level-lo.patch+
20200422/v15-0003-Extend-the-output-plugin-API-with-stream-methods.patch+
20200422/v15-0004-Gracefully-handle-concurrent-aborts-of-uncommitt.patch+
20200422/v15-0005-Implement-streaming-mode-in-ReorderBuffer.patch       
+
20200422/v15-0006-Add-support-for-streaming-to-built-in-replicatio.patch+
20200422/v15-0007-Track-statistics-for-streaming.patch                  
+
20200422/v15-0008-Enable-streaming-for-all-subscription-TAP-tests.patch 
+
20200422/v15-0009-Add-TAP-test-for-streaming-vs.-DDL.patch              
+
20200422/v15-0010-Bugfix-handling-of-incomplete-toast-tuple.patch       
+
20200422/v15-0011-Provide-new-api-to-get-the-streaming-changes.patch    
+
20200414/bugfix_in_schema_sent.patch



>> grep -A7 -B7 make.check_contrib.out:
>> 
>> contrib/make.check_contrib.out-============== initializing database
>> system           ==============
>> contrib/make.check_contrib.out-============== starting postmaster
>>              ==============
>> contrib/make.check_contrib.out-running on port 64464 with PID 9175
>> contrib/make.check_contrib.out-============== creating database
>> "contrib_regression" ==============
>> contrib/make.check_contrib.out-CREATE DATABASE
>> contrib/make.check_contrib.out-ALTER DATABASE
>> contrib/make.check_contrib.out-============== running regression test
>> queries        ==============
>> contrib/make.check_contrib.out:test ddl                          ...
>> FAILED      840 ms
>> contrib/make.check_contrib.out-test xact                         ... 
>> ok
>>           24 ms
>> contrib/make.check_contrib.out-test rewrite                      ... 
>> ok
>>          187 ms
>> contrib/make.check_contrib.out-test toast                        ... 
>> ok
>>          851 ms
>> contrib/make.check_contrib.out-test permissions                  ... 
>> ok
>>           26 ms
>> contrib/make.check_contrib.out-test decoding_in_xact             ... 
>> ok
>>           31 ms
>> contrib/make.check_contrib.out-test decoding_into_rel            ... 
>> ok
>>           25 ms
>> contrib/make.check_contrib.out-test binary                       ... 
>> ok
>>           12 ms
>> 
>> Otherwise patches apply and build OK so will go run some tests...
> 
> Thanks
> 
> 
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com

Вложения

regression.diffs

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

24 апреля 2020 г., 09:24:46

On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:
>
> On 2020-04-23 05:24, Dilip Kumar wrote:
> > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
> >>
> >> The 'ddl' one is apparently not quite fixed  - I get this in (cd
> >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> >> build)
> >
> > Can you send me the contrib/test_decoding/regression.diffs file?
>
> Attached.

So from regression.diff, it appears that in failing in memory
allocation (+ERROR:  invalid memory alloc request size
94119198201896).  My colleague tried to reproduce this in a different
environment but there is no success so far.  One more thing surprises
me is that after
(v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
actually, it should never go for the streaming path. However, we can
not ignore the fact that some of the changes might impact the
non-streaming path as well.  Is it possible for you to somehow stop or
break the code and send the stack trace?  One idea is by seeing the
log we can see from where the error is raised i.e MemoryContextAlloc
or palloc or some other similar function.  Once we know that we can
convert that error to an assert and find the call stack.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

27 апреля 2020 г., 13:34:55

On Fri, Apr 17, 2020 at 1:40 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> On Tue, Apr 14, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> Few review comments from 0006-Add-support-for-streaming*.patch
>
> + subxacts[nsubxacts].offset = lseek(stream_fd, 0, SEEK_END);
> lseek can return (-)ve value in case of error, right?
>
> + /*
> + * We might need to create the tablespace's tempfile directory, if no
> + * one has yet done so.
> + *
> + * Don't check for error from mkdir; it could fail if the directory
> + * already exists (maybe someone else just did the same thing).  If
> + * it doesn't work then we'll bomb out when opening the file
> + */
> + mkdir(tempdirpath, S_IRWXU);
> If that's the only reason, perhaps we can use something like following:
>
> if (mkdir(tempdirpath, S_IRWXU) < 0 && errno != EEXIST)
> throw error;

Done

>
> +
> + CloseTransientFile(stream_fd);
> Might failed to close the file. We should handle the case.

Changed

Still, one place is pending because I don't have the filename there to
report an error.  One option is we can just give an error without the
filename.  I will try to think about this part.

> Also, I think we need some implementations in dumpSubscription() to
> dump the (streaming = 'on') option.

Right,  created another patch and attached.

I have also fixed a couple of bugs internally reported by my colleague
Neha Sharma.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:
> >
> > On 2020-04-23 05:24, Dilip Kumar wrote:
> > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
> > >>
> > >> The 'ddl' one is apparently not quite fixed  - I get this in (cd
> > >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> > >> build)
> > >
> > > Can you send me the contrib/test_decoding/regression.diffs file?
> >
> > Attached.
>
> So from regression.diff, it appears that in failing in memory
> allocation (+ERROR:  invalid memory alloc request size
> 94119198201896).  My colleague tried to reproduce this in a different
> environment but there is no success so far.  One more thing surprises
> me is that after
> (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
> actually, it should never go for the streaming path. However, we can
> not ignore the fact that some of the changes might impact the
> non-streaming path as well.  Is it possible for you to somehow stop or
> break the code and send the stack trace?  One idea is by seeing the
> log we can see from where the error is raised i.e MemoryContextAlloc
> or palloc or some other similar function.  Once we know that we can
> convert that error to an assert and find the call stack.
>
> --

Thanks Erik for reporting this issue.

I am able to reproduce this issue(+ERROR:  invalid memory alloc
request size) on the top of v16 patch set. I applied all patches(12
patches) of v16 series and then I fired "make check -i" from
"contrib/test_decoding" folder. Below is stack trace of error:

#0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
size=94605581787992) at mcxt.c:806
#1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
(rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
reorderbuffer.c:3680
#2 0x0000560b130f0662 in ReorderBufferRestoreChanges
(rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
segno=0x560b1418ad20) at reorderbuffer.c:3564
#3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
#4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
command_id=0, streaming=false)
at reorderbuffer.c:1785
#5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
xid=508, commit_lsn=25986584, end_lsn=25989088,
commit_time=641449268431600, origin_id=0, origin_lsn=0)
at reorderbuffer.c:2315
#6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
#7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
buf=0x7ffef18b19b0) at decode.c:261
#8 0x0000560b130cf99a in LogicalDecodingProcessRecord
(ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
#9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
(fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
at logicalfuncs.c:285
#10 0x0000560b130dbe71 in pg_logical_slot_get_changes
(fcinfo=0x560b1417ee50) at logicalfuncs.c:354
#11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
(setexpr=0x560b14177838, econtext=0x560b14177748,
argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
randomAccess=false) at execSRF.c:234
#12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
nodeFunctionscan.c:94
#13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:133
#14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
<FunctionRecheck>) at execScan.c:199
#15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
nodeFunctionscan.c:270
#16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
execProcnode.c:450
#17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
../../../src/include/executor/executor.h:245
#18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
at nodeAgg.c:566
#19 0x0000560b12e4398f in agg_fill_hash_table
(aggstate=0x560b14176f40) at nodeAgg.c:2518
#20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
#21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
execProcnode.c:450
#22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
../../../src/include/executor/executor.h:245
#23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
#24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
execProcnode.c:450
#25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
../../../src/include/executor/executor.h:245
#26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
planstate=0x560b14176d28, use_parallel_mode=false,
operation=CMD_SELECT, sendTuples=true, numberTuples=0,
direction=ForwardScanDirection, dest=0x560b1419d188,
execute_once=true) at execMain.c:1646
#27 0x0000560b12e11a19 in standard_ExecutorRun
(queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
execute_once=true) at execMain.c:364
#28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:308
#29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
#30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x560b1419d188, altdest=0x560b1419d188,
qc=0x7ffef18b2350) at pquery.c:756
#31 0x0000560b131e550b in exec_simple_query (
query_string=0x560b14076720 "/ display results, but hide most of the
output /\nSELECT count(*), min(data), max(data)\nFROM
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
postgres.c:1239
#32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
"mahendrathalor") at postgres.c:4315
#33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
#34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
postmaster.c:4202
#35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
#36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
at postmaster.c:1400
#37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210

I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
am looking into this issue with Dilip.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Вложения

regression.diffs

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Mahendra Singh Thalor

Дата:

29 апреля 2020 г., 10:07:19

On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>
> On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:
> > >
> > > On 2020-04-23 05:24, Dilip Kumar wrote:
> > > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > >>
> > > >> The 'ddl' one is apparently not quite fixed  - I get this in (cd
> > > >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> > > >> build)
> > > >
> > > > Can you send me the contrib/test_decoding/regression.diffs file?
> > >
> > > Attached.
> >
> > So from regression.diff, it appears that in failing in memory
> > allocation (+ERROR:  invalid memory alloc request size
> > 94119198201896).  My colleague tried to reproduce this in a different
> > environment but there is no success so far.  One more thing surprises
> > me is that after
> > (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
> > actually, it should never go for the streaming path. However, we can
> > not ignore the fact that some of the changes might impact the
> > non-streaming path as well.  Is it possible for you to somehow stop or
> > break the code and send the stack trace?  One idea is by seeing the
> > log we can see from where the error is raised i.e MemoryContextAlloc
> > or palloc or some other similar function.  Once we know that we can
> > convert that error to an assert and find the call stack.
> >
> > --
>
> Thanks Erik for reporting this issue.
>
> I am able to reproduce this issue(+ERROR:  invalid memory alloc
> request size) on the top of v16 patch set. I applied all patches(12
> patches) of v16 series and then I fired "make check -i" from
> "contrib/test_decoding" folder. Below is stack trace of error:
>
> #0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
> size=94605581787992) at mcxt.c:806
> #1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
> (rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
> reorderbuffer.c:3680
> #2 0x0000560b130f0662 in ReorderBufferRestoreChanges
> (rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
> segno=0x560b1418ad20) at reorderbuffer.c:3564
> #3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
> txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
> #4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
> txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
> command_id=0, streaming=false)
> at reorderbuffer.c:1785
> #5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
> xid=508, commit_lsn=25986584, end_lsn=25989088,
> commit_time=641449268431600, origin_id=0, origin_lsn=0)
> at reorderbuffer.c:2315
> #6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
> buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
> #7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
> buf=0x7ffef18b19b0) at decode.c:261
> #8 0x0000560b130cf99a in LogicalDecodingProcessRecord
> (ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
> #9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
> (fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
> at logicalfuncs.c:285
> #10 0x0000560b130dbe71 in pg_logical_slot_get_changes
> (fcinfo=0x560b1417ee50) at logicalfuncs.c:354
> #11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
> (setexpr=0x560b14177838, econtext=0x560b14177748,
> argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
> randomAccess=false) at execSRF.c:234
> #12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
> nodeFunctionscan.c:94
> #13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
> accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> <FunctionRecheck>) at execScan.c:133
> #14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
> accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> <FunctionRecheck>) at execScan.c:199
> #15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
> nodeFunctionscan.c:270
> #16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
> execProcnode.c:450
> #17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
> ../../../src/include/executor/executor.h:245
> #18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
> at nodeAgg.c:566
> #19 0x0000560b12e4398f in agg_fill_hash_table
> (aggstate=0x560b14176f40) at nodeAgg.c:2518
> #20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
> #21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
> execProcnode.c:450
> #22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
> ../../../src/include/executor/executor.h:245
> #23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
> #24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
> execProcnode.c:450
> #25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
> ../../../src/include/executor/executor.h:245
> #26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
> planstate=0x560b14176d28, use_parallel_mode=false,
> operation=CMD_SELECT, sendTuples=true, numberTuples=0,
> direction=ForwardScanDirection, dest=0x560b1419d188,
> execute_once=true) at execMain.c:1646
> #27 0x0000560b12e11a19 in standard_ExecutorRun
> (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
> execute_once=true) at execMain.c:364
> #28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
> direction=ForwardScanDirection, count=0, execute_once=true) at
> execMain.c:308
> #29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
> forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
> #30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
> count=9223372036854775807, isTopLevel=true, run_once=true,
> dest=0x560b1419d188, altdest=0x560b1419d188,
> qc=0x7ffef18b2350) at pquery.c:756
> #31 0x0000560b131e550b in exec_simple_query (
> query_string=0x560b14076720 "/ display results, but hide most of the
> output /\nSELECT count(*), min(data), max(data)\nFROM
> pg_logical_slot_get_changes('regression_slot', NULL, NULL,
> 'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
> postgres.c:1239
> #32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
> dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
> "mahendrathalor") at postgres.c:4315
> #33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
> #34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
> postmaster.c:4202
> #35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
> #36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
> at postmaster.c:1400
> #37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210
>
> I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
> am looking into this issue with Dilip.

This error is due to invalid size.

diff --git a/src/backend/replication/logical/reorderbuffer.c
b/src/backend/replication/logical/reorderbuffer.c
index eed9a5048b..487c1b4252 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
ReorderBufferTXN *txn,

                                change->data.inval.invalidations =
                                                MemoryContextAlloc(rb->context,
-
            change->data.msg.message_size);
+
            inval_size);
                                /* read the message */

memcpy(change->data.inval.invalidations, data, inval_size);
                                data += inval_size;

Above change, fixes the error. Thanks Dilip for helping.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 апреля 2020 г., 10:20:22

On Wed, Apr 29, 2020 at 12:37 PM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:
>
> On Wed, 29 Apr 2020 at 11:15, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
> >
> > On Fri, 24 Apr 2020 at 11:55, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Apr 23, 2020 at 2:28 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > >
> > > > On 2020-04-23 05:24, Dilip Kumar wrote:
> > > > > On Wed, Apr 22, 2020 at 9:31 PM Erik Rijkers <er@xs4all.nl> wrote:
> > > > >>
> > > > >> The 'ddl' one is apparently not quite fixed  - I get this in (cd
> > > > >> contrib; make check)' (in both assert-enabled and non-assert-enabled
> > > > >> build)
> > > > >
> > > > > Can you send me the contrib/test_decoding/regression.diffs file?
> > > >
> > > > Attached.
> > >
> > > So from regression.diff, it appears that in failing in memory
> > > allocation (+ERROR:  invalid memory alloc request size
> > > 94119198201896).  My colleague tried to reproduce this in a different
> > > environment but there is no success so far.  One more thing surprises
> > > me is that after
> > > (v15-0011-Provide-new-api-to-get-the-streaming-changes.patch)
> > > actually, it should never go for the streaming path. However, we can
> > > not ignore the fact that some of the changes might impact the
> > > non-streaming path as well.  Is it possible for you to somehow stop or
> > > break the code and send the stack trace?  One idea is by seeing the
> > > log we can see from where the error is raised i.e MemoryContextAlloc
> > > or palloc or some other similar function.  Once we know that we can
> > > convert that error to an assert and find the call stack.
> > >
> > > --
> >
> > Thanks Erik for reporting this issue.
> >
> > I am able to reproduce this issue(+ERROR:  invalid memory alloc
> > request size) on the top of v16 patch set. I applied all patches(12
> > patches) of v16 series and then I fired "make check -i" from
> > "contrib/test_decoding" folder. Below is stack trace of error:
> >
> > #0 0x0000560b1350902d in MemoryContextAlloc (context=0x560b14188d70,
> > size=94605581787992) at mcxt.c:806
> > #1 0x0000560b130f0ad5 in ReorderBufferRestoreChange
> > (rb=0x560b14188e90, txn=0x560b141baf08, data=0x560b1418a5e8 "K") at
> > reorderbuffer.c:3680
> > #2 0x0000560b130f0662 in ReorderBufferRestoreChanges
> > (rb=0x560b14188e90, txn=0x560b141baf08, file=0x560b1418ad10,
> > segno=0x560b1418ad20) at reorderbuffer.c:3564
> > #3 0x0000560b130e918a in ReorderBufferIterTXNInit (rb=0x560b14188e90,
> > txn=0x560b141baf08, iter_state=0x7ffef18b1600) at reorderbuffer.c:1186
> > #4 0x0000560b130eaee1 in ReorderBufferProcessTXN (rb=0x560b14188e90,
> > txn=0x560b141baf08, commit_lsn=25986584, snapshot_now=0x560b141b74d8,
> > command_id=0, streaming=false)
> > at reorderbuffer.c:1785
> > #5 0x0000560b130ecae1 in ReorderBufferCommit (rb=0x560b14188e90,
> > xid=508, commit_lsn=25986584, end_lsn=25989088,
> > commit_time=641449268431600, origin_id=0, origin_lsn=0)
> > at reorderbuffer.c:2315
> > #6 0x0000560b130d14a1 in DecodeCommit (ctx=0x560b1416ea80,
> > buf=0x7ffef18b19b0, parsed=0x7ffef18b1850, xid=508) at decode.c:654
> > #7 0x0000560b130cff98 in DecodeXactOp (ctx=0x560b1416ea80,
> > buf=0x7ffef18b19b0) at decode.c:261
> > #8 0x0000560b130cf99a in LogicalDecodingProcessRecord
> > (ctx=0x560b1416ea80, record=0x560b1416ee00) at decode.c:130
> > #9 0x0000560b130dbbbc in pg_logical_slot_get_changes_guts
> > (fcinfo=0x560b1417ee50, confirm=true, binary=false, streaming=false)
> > at logicalfuncs.c:285
> > #10 0x0000560b130dbe71 in pg_logical_slot_get_changes
> > (fcinfo=0x560b1417ee50) at logicalfuncs.c:354
> > #11 0x0000560b12e294d4 in ExecMakeTableFunctionResult
> > (setexpr=0x560b14177838, econtext=0x560b14177748,
> > argContext=0x560b1417ed30, expectedDesc=0x560b141814a0,
> > randomAccess=false) at execSRF.c:234
> > #12 0x0000560b12e5490f in FunctionNext (node=0x560b14177630) at
> > nodeFunctionscan.c:94
> > #13 0x0000560b12e2c108 in ExecScanFetch (node=0x560b14177630,
> > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> > <FunctionRecheck>) at execScan.c:133
> > #14 0x0000560b12e2c227 in ExecScan (node=0x560b14177630,
> > accessMtd=0x560b12e54836 <FunctionNext>, recheckMtd=0x560b12e54e15
> > <FunctionRecheck>) at execScan.c:199
> > #15 0x0000560b12e54e9b in ExecFunctionScan (pstate=0x560b14177630) at
> > nodeFunctionscan.c:270
> > #16 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14177630) at
> > execProcnode.c:450
> > #17 0x0000560b12e3e172 in ExecProcNode (node=0x560b14177630) at
> > ../../../src/include/executor/executor.h:245
> > #18 0x0000560b12e3e998 in fetch_input_tuple (aggstate=0x560b14176f40)
> > at nodeAgg.c:566
> > #19 0x0000560b12e4398f in agg_fill_hash_table
> > (aggstate=0x560b14176f40) at nodeAgg.c:2518
> > #20 0x0000560b12e42c9a in ExecAgg (pstate=0x560b14176f40) at nodeAgg.c:2139
> > #21 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176f40) at
> > execProcnode.c:450
> > #22 0x0000560b12e8bb58 in ExecProcNode (node=0x560b14176f40) at
> > ../../../src/include/executor/executor.h:245
> > #23 0x0000560b12e8bd59 in ExecSort (pstate=0x560b14176d28) at nodeSort.c:108
> > #24 0x0000560b12e24e23 in ExecProcNodeFirst (node=0x560b14176d28) at
> > execProcnode.c:450
> > #25 0x0000560b12e10e71 in ExecProcNode (node=0x560b14176d28) at
> > ../../../src/include/executor/executor.h:245
> > #26 0x0000560b12e15c4c in ExecutePlan (estate=0x560b14176af0,
> > planstate=0x560b14176d28, use_parallel_mode=false,
> > operation=CMD_SELECT, sendTuples=true, numberTuples=0,
> > direction=ForwardScanDirection, dest=0x560b1419d188,
> > execute_once=true) at execMain.c:1646
> > #27 0x0000560b12e11a19 in standard_ExecutorRun
> > (queryDesc=0x560b1412db10, direction=ForwardScanDirection, count=0,
> > execute_once=true) at execMain.c:364
> > #28 0x0000560b12e116e1 in ExecutorRun (queryDesc=0x560b1412db10,
> > direction=ForwardScanDirection, count=0, execute_once=true) at
> > execMain.c:308
> > #29 0x0000560b131f2177 in PortalRunSelect (portal=0x560b140db860,
> > forward=true, count=0, dest=0x560b1419d188) at pquery.c:912
> > #30 0x0000560b131f1b14 in PortalRun (portal=0x560b140db860,
> > count=9223372036854775807, isTopLevel=true, run_once=true,
> > dest=0x560b1419d188, altdest=0x560b1419d188,
> > qc=0x7ffef18b2350) at pquery.c:756
> > #31 0x0000560b131e550b in exec_simple_query (
> > query_string=0x560b14076720 "/ display results, but hide most of the
> > output /\nSELECT count(*), min(data), max(data)\nFROM
> > pg_logical_slot_get_changes('regression_slot', NULL, NULL,
> > 'include-xids', '0', 'skip-empty-xacts', '1')\nG"...) at
> > postgres.c:1239
> > #32 0x0000560b131ee343 in PostgresMain (argc=1, argv=0x560b1409faa0,
> > dbname=0x560b1409f858 "contrib_regression", username=0x560b1409f830
> > "mahendrathalor") at postgres.c:4315
> > #33 0x0000560b130a325b in BackendRun (port=0x560b14096880) at postmaster.c:4510
> > #34 0x0000560b130a22c3 in BackendStartup (port=0x560b14096880) at
> > postmaster.c:4202
> > #35 0x0000560b1309a5cc in ServerLoop () at postmaster.c:1727
> > #36 0x0000560b130997c9 in PostmasterMain (argc=8, argv=0x560b1406f010)
> > at postmaster.c:1400
> > #37 0x0000560b12ee9530 in main (argc=8, argv=0x560b1406f010) at main.c:210
> >
> > I have Ubuntu setup. I think, this is reproducing into Ubuntu only. I
> > am looking into this issue with Dilip.
>
> This error is due to invalid size.
>
> diff --git a/src/backend/replication/logical/reorderbuffer.c
> b/src/backend/replication/logical/reorderbuffer.c
> index eed9a5048b..487c1b4252 100644
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -3678,7 +3678,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
>
>                                 change->data.inval.invalidations =
>                                                 MemoryContextAlloc(rb->context,
> -
>             change->data.msg.message_size);
> +
>             inval_size);
>                                 /* read the message */
>
> memcpy(change->data.inval.invalidations, data, inval_size);
>                                 data += inval_size;
>
> Above change, fixes the error. Thanks Dilip for helping.

Thanks, Mahendra for reproducing and help in fixing this.  I will
include this change in my next patch set.

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 апреля 2020 г., 12:26:49

On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > [latest patches]
> >
> > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > -     Any actions leading to transaction ID assignment are prohibited.
> > That, among others,
> > +     Note that access to user catalog tables or regular system catalog tables
> > +     in the output plugins has to be done via the
> > <literal>systable_*</literal> scan APIs only.
> > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> > +     Additionally, any actions leading to transaction ID assignment
> > are prohibited. That, among others,
> > ..
> > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> >   bool valid;
> >
> >   /*
> > + * We don't expect direct calls to heap_fetch with valid
> > + * CheckXidAlive for regular tables. Track that below.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > +
> >
> > I think comments and code don't match.  In the comment, we are saying
> > that via output plugins access to user catalog tables or regular
> > system catalog tables won't be allowed via heap_* APIs but code
> > doesn't seem to reflect it.  I feel only
> > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > original discussion about this point [1] (Refer "I think it'd also be
> > good to add assertions to codepaths not going through systable_*
> > asserting that ...").
>
> Right,  So I think we can just add an assert in these function that
> Assert(!TransactionIdIsValid(CheckXidAlive)) ?
>
> >
> > Isn't it better to block the scan to user catalog tables or regular
> > system catalog tables for tableam scan APIs rather than at the heap
> > level?  There might be some APIs like heap_getnext where such a check
> > might still be required but I guess it is still better to block at
> > tableam level.
> >
> > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
>
> Okay, let me analyze this part.  Because someplace we have to keep at
> heap level like heap_getnext and other places at tableam level so it
> seems a bit inconsistent.  Also, I think the number of checks might
> going to increase because some of the heap functions like
> heap_hot_search_buffer are being called from multiple tableam calls,
> so we need to put check at every place.
>
> Another point is that I feel some of the checks what we have today
> might not be required like heap_finish_speculative, is not fetching
> any tuple for us so why do we need to care about this function?

While testing these changes, I have noticed that the systable_* APIs
internally, calls tableam apis and so if we just put assert
Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
that assert.  Whether we put these assert in heap APIs or the tableam
APIs because systable_ always access heap through tableam APIs.

Refer below callstack
#0  table_index_fetch_tuple (scan=0x2392558, tid=0x2392270,
snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276,
all_dead=0x7fff4b6cc89e)
    at ../../../../src/include/access/tableam.h:1035
#1  0x00000000005100b6 in index_fetch_heap (scan=0x2392210,
slot=0x2391f60) at indexam.c:577
#2  0x00000000005101ea in index_getnext_slot (scan=0x2392210,
direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637
#3  0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474
#4  0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0,
relfilenode=16593) at relfilenodemap.c:213
#5  0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0,
txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168,
command_id=0, streaming=false)
    at reorderbuffer.c:1823
#6  0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518,
commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448,
origin_id=0, origin_lsn=0)
    at reorderbuffer.c:2315
#7  0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0,
buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654
#8  0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0,
buf=0x7fff4b6cce30) at decode.c:261
#9  0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0,
record=0x22e19a0) at decode.c:130

So basically, the problem is that we can not distinguish whether the
tableam/heap routine is called directly or via systable_*.

Now I understand the current code was actually giving error for the
user table not the system table with the assumption that the system
table will come to this function only via systable_*.  Only user table
can come directly.  So if this is not a system table i.e. we reach
here directly so error out.  Now, I am not sure if it is not for the
system table then what is the purpose of throwing that error?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 апреля 2020 г., 12:49:15

On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > [latest patches]
> > >
> > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > -     Any actions leading to transaction ID assignment are prohibited.
> > > That, among others,
> > > +     Note that access to user catalog tables or regular system catalog tables
> > > +     in the output plugins has to be done via the
> > > <literal>systable_*</literal> scan APIs only.
> > > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> > > +     Additionally, any actions leading to transaction ID assignment
> > > are prohibited. That, among others,
> > > ..
> > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > >   bool valid;
> > >
> > >   /*
> > > + * We don't expect direct calls to heap_fetch with valid
> > > + * CheckXidAlive for regular tables. Track that below.
> > > + */
> > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > +
> > >
> > > I think comments and code don't match.  In the comment, we are saying
> > > that via output plugins access to user catalog tables or regular
> > > system catalog tables won't be allowed via heap_* APIs but code
> > > doesn't seem to reflect it.  I feel only
> > > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > > original discussion about this point [1] (Refer "I think it'd also be
> > > good to add assertions to codepaths not going through systable_*
> > > asserting that ...").
> >
> > Right,  So I think we can just add an assert in these function that
> > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> >
> > >
> > > Isn't it better to block the scan to user catalog tables or regular
> > > system catalog tables for tableam scan APIs rather than at the heap
> > > level?  There might be some APIs like heap_getnext where such a check
> > > might still be required but I guess it is still better to block at
> > > tableam level.
> > >
> > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> >
> > Okay, let me analyze this part.  Because someplace we have to keep at
> > heap level like heap_getnext and other places at tableam level so it
> > seems a bit inconsistent.  Also, I think the number of checks might
> > going to increase because some of the heap functions like
> > heap_hot_search_buffer are being called from multiple tableam calls,
> > so we need to put check at every place.
> >
> > Another point is that I feel some of the checks what we have today
> > might not be required like heap_finish_speculative, is not fetching
> > any tuple for us so why do we need to care about this function?
>
> While testing these changes, I have noticed that the systable_* APIs
> internally, calls tableam apis and so if we just put assert
> Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> that assert.  Whether we put these assert in heap APIs or the tableam
> APIs because systable_ always access heap through tableam APIs.
>
> Refer below callstack
> #0  table_index_fetch_tuple (scan=0x2392558, tid=0x2392270,
> snapshot=0x2392178, slot=0x2391f60, call_again=0x2392276,
> all_dead=0x7fff4b6cc89e)
>     at ../../../../src/include/access/tableam.h:1035
> #1  0x00000000005100b6 in index_fetch_heap (scan=0x2392210,
> slot=0x2391f60) at indexam.c:577
> #2  0x00000000005101ea in index_getnext_slot (scan=0x2392210,
> direction=ForwardScanDirection, slot=0x2391f60) at indexam.c:637
> #3  0x000000000050e8f9 in systable_getnext (sysscan=0x2391f08) at genam.c:474
> #4  0x0000000000aa44a2 in RelidByRelfilenode (reltablespace=0,
> relfilenode=16593) at relfilenodemap.c:213
> #5  0x00000000008a64da in ReorderBufferProcessTXN (rb=0x23734b0,
> txn=0x2398e28, commit_lsn=23953600, snapshot_now=0x237b168,
> command_id=0, streaming=false)
>     at reorderbuffer.c:1823
> #6  0x00000000008a7201 in ReorderBufferCommit (rb=0x23734b0, xid=518,
> commit_lsn=23953600, end_lsn=23953648, commit_time=641466886013448,
> origin_id=0, origin_lsn=0)
>     at reorderbuffer.c:2315
> #7  0x00000000008985b1 in DecodeCommit (ctx=0x22e16a0,
> buf=0x7fff4b6cce30, parsed=0x7fff4b6ccca0, xid=518) at decode.c:654
> #8  0x0000000000897a76 in DecodeXactOp (ctx=0x22e16a0,
> buf=0x7fff4b6cce30) at decode.c:261
> #9  0x0000000000897739 in LogicalDecodingProcessRecord (ctx=0x22e16a0,
> record=0x22e19a0) at decode.c:130
>
> So basically, the problem is that we can not distinguish whether the
> tableam/heap routine is called directly or via systable_*.
>
> Now I understand the current code was actually giving error for the
> user table not the system table with the assumption that the system
> table will come to this function only via systable_*.  Only user table
> can come directly.  So if this is not a system table i.e. we reach
> here directly so error out.  Now, I am not sure if it is not for the
> system table then what is the purpose of throwing that error?

Putting some more thought upon this, I am just wondering what do we
really want any such check because, we are always getting relation
description from the reorder buffer code, not from the pgoutput
plugin.  And, our main issue with the concurrent abort is that we
shall not get the wrong catalog entry for decoding our tuple.  So if
we are always getting our relation entry using RelationIdGetRelation
then why should we bother about how output plugin is accessing
system/user relations?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

30 апреля 2020 г., 10:00:49

On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > [latest patches]
> > > >
> > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > > -     Any actions leading to transaction ID assignment are prohibited.
> > > > That, among others,
> > > > +     Note that access to user catalog tables or regular system catalog tables
> > > > +     in the output plugins has to be done via the
> > > > <literal>systable_*</literal> scan APIs only.
> > > > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> > > > +     Additionally, any actions leading to transaction ID assignment
> > > > are prohibited. That, among others,
> > > > ..
> > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > > >   bool valid;
> > > >
> > > >   /*
> > > > + * We don't expect direct calls to heap_fetch with valid
> > > > + * CheckXidAlive for regular tables. Track that below.
> > > > + */
> > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > > +
> > > >
> > > > I think comments and code don't match.  In the comment, we are saying
> > > > that via output plugins access to user catalog tables or regular
> > > > system catalog tables won't be allowed via heap_* APIs but code
> > > > doesn't seem to reflect it.  I feel only
> > > > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > > > original discussion about this point [1] (Refer "I think it'd also be
> > > > good to add assertions to codepaths not going through systable_*
> > > > asserting that ...").
> > >
> > > Right,  So I think we can just add an assert in these function that
> > > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> > >
> > > >
> > > > Isn't it better to block the scan to user catalog tables or regular
> > > > system catalog tables for tableam scan APIs rather than at the heap
> > > > level?  There might be some APIs like heap_getnext where such a check
> > > > might still be required but I guess it is still better to block at
> > > > tableam level.
> > > >
> > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> > >
> > > Okay, let me analyze this part.  Because someplace we have to keep at
> > > heap level like heap_getnext and other places at tableam level so it
> > > seems a bit inconsistent.  Also, I think the number of checks might
> > > going to increase because some of the heap functions like
> > > heap_hot_search_buffer are being called from multiple tableam calls,
> > > so we need to put check at every place.
> > >
> > > Another point is that I feel some of the checks what we have today
> > > might not be required like heap_finish_speculative, is not fetching
> > > any tuple for us so why do we need to care about this function?
> >
> > While testing these changes, I have noticed that the systable_* APIs
> > internally, calls tableam apis and so if we just put assert
> > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> > that assert.  Whether we put these assert in heap APIs or the tableam
> > APIs because systable_ always access heap through tableam APIs.
> >
..
..
>
> Putting some more thought upon this, I am just wondering what do we
> really want any such check because, we are always getting relation
> description from the reorder buffer code, not from the pgoutput
> plugin.
>

But can't they access other catalogs like pg_publication*?  I think
the basic thing we want to ensure here is that all historic accesses
always use systable* APIs to access catalogs.  We can ensure that via
having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

01 мая 2020 г., 18:10:51

On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > >
> > > > > [latest patches]
> > > > >
> > > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > > > -     Any actions leading to transaction ID assignment are prohibited.
> > > > > That, among others,
> > > > > +     Note that access to user catalog tables or regular system catalog tables
> > > > > +     in the output plugins has to be done via the
> > > > > <literal>systable_*</literal> scan APIs only.
> > > > > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> > > > > +     Additionally, any actions leading to transaction ID assignment
> > > > > are prohibited. That, among others,
> > > > > ..
> > > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > > > >   bool valid;
> > > > >
> > > > >   /*
> > > > > + * We don't expect direct calls to heap_fetch with valid
> > > > > + * CheckXidAlive for regular tables. Track that below.
> > > > > + */
> > > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > > > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation))))
> > > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > > > +
> > > > >
> > > > > I think comments and code don't match.  In the comment, we are saying
> > > > > that via output plugins access to user catalog tables or regular
> > > > > system catalog tables won't be allowed via heap_* APIs but code
> > > > > doesn't seem to reflect it.  I feel only
> > > > > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > > > > original discussion about this point [1] (Refer "I think it'd also be
> > > > > good to add assertions to codepaths not going through systable_*
> > > > > asserting that ...").
> > > >
> > > > Right,  So I think we can just add an assert in these function that
> > > > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> > > >
> > > > >
> > > > > Isn't it better to block the scan to user catalog tables or regular
> > > > > system catalog tables for tableam scan APIs rather than at the heap
> > > > > level?  There might be some APIs like heap_getnext where such a check
> > > > > might still be required but I guess it is still better to block at
> > > > > tableam level.
> > > > >
> > > > > [1] - https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> > > >
> > > > Okay, let me analyze this part.  Because someplace we have to keep at
> > > > heap level like heap_getnext and other places at tableam level so it
> > > > seems a bit inconsistent.  Also, I think the number of checks might
> > > > going to increase because some of the heap functions like
> > > > heap_hot_search_buffer are being called from multiple tableam calls,
> > > > so we need to put check at every place.
> > > >
> > > > Another point is that I feel some of the checks what we have today
> > > > might not be required like heap_finish_speculative, is not fetching
> > > > any tuple for us so why do we need to care about this function?
> > >
> > > While testing these changes, I have noticed that the systable_* APIs
> > > internally, calls tableam apis and so if we just put assert
> > > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> > > that assert.  Whether we put these assert in heap APIs or the tableam
> > > APIs because systable_ always access heap through tableam APIs.
> > >
> ..
> ..
> >
> > Putting some more thought upon this, I am just wondering what do we
> > really want any such check because, we are always getting relation
> > description from the reorder buffer code, not from the pgoutput
> > plugin.
> >
>
> But can't they access other catalogs like pg_publication*?  I think
> the basic thing we want to ensure here is that all historic accesses
> always use systable* APIs to access catalogs.  We can ensure that via
> having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

Yeah, it can.  So I have changed it now, actually along with
CheckXidLive, I have kept one more flag so whenever CheckXidLive is
set and we pass through systable_beginscan we will set that flag.  So
while accessing the tableam API we will set if CheckXidLive is set
then another flag must also be set otherwise we through an error.

Apart from this, I have also fixed one defect raised by my colleague
Neha Sharma.  That issue is the incomplete toast tuple flag was not
reset when the main table tuple was inserted through speculative
insert and due to that data was not streamed even if later we were
getting speculative confirm because incomplete toast flag was never
reset.  This patch also includes the fix for the issue raised by Erik.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > But can't they access other catalogs like pg_publication*?  I think
> > > the basic thing we want to ensure here is that all historic accesses
> > > always use systable* APIs to access catalogs.  We can ensure that via
> > > having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
> >
> > Yeah, it can.  So I have changed it now, actually along with
> > CheckXidLive, I have kept one more flag so whenever CheckXidLive is
> > set and we pass through systable_beginscan we will set that flag.  So
> > while accessing the tableam API we will set if CheckXidLive is set
> > then another flag must also be set otherwise we through an error.
> >
>
> Okay, I have reviewed these changes and below are my comments:
>
> Review of  v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> --------------------------------------------------------------------
> 1.
> + /*
> + * If CheckXidAlive is set then set a flag that this call is passed through
> + * systable_beginscan.  See detailed  comments at snapmgr.c where these
> + * variables are declared.
> + */
> + if (TransactionIdIsValid(CheckXidAlive))
> + sysbegin_called = true;
>
> a. How about calling this variable as bsysscan or sysscan instead of
> sysbegin_called?

Done

> b. There is an extra space between detailed and comments.  A similar
> change is required at other place where this comment is used.

Done

> c. How about writing the first line as "If CheckXidAlive is set then
> set a flag to indicate that system table scan is in-progress."
>
> 2.
> -     Any actions leading to transaction ID assignment are prohibited.
> That, among others,
> -     includes writing to tables, performing DDL changes, and
> -     calling <literal>pg_current_xact_id()</literal>.
> +     Note that access to user catalog tables or regular system
> catalog tables in
> +     the output plugins has to be done via the
> <literal>systable_*</literal> scan
> +     APIs only. The user tables should not be accesed in the output
> plugins anyways.
> +     Access via the <literal>heap_*</literal> scan APIs will error out.
>
> The line "The user tables should not be accesed in the output plugins
> anyways." seems a bit of out of place.  I don't think this is required
> here.  If you read the previous paragraph in the same document it is
> written: "Read only access to relations is permitted as long as only
> relations are accessed that either have been created by
> <command>initdb</command> in the <literal>pg_catalog</literal> schema,
> or have been marked as user provided catalog tables using ...".  I
> think that is sufficient to convey the information that the newly
> added line by you is trying to convey.

Right.

>
> 3.
> + /*
> + * We don't expect direct calls to this routine when CheckXidAlive is a
> + * valid transaction id, this should only come through systable_* call.
> + * CheckXidAlive is set during logical decoding of a transactions.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
> + elog(ERROR, "unexpected heap_getnext call during logical decoding");
>
> How about changing this comment as "We don't expect direct calls to
> heap_getnext with valid CheckXidAlive for catalog or regular tables.
> See detailed comments at snapmgr.c where these variables are
> declared."?  Change the similar comment used in other places in the
> patch.
>
> For this specific API, we can also say "Normally we have such a check
> at tableam level API but this is called from many places so we need to
> ensure it here."

Done

>
> 4.
> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
> + * out.  We can't directly use TransactionIdDidAbort as after crash such
> + * transaction might not have been marked as aborted.  See detailed  comments
> + * at snapmgr.c where the variable is declared.
> + */
> +static inline void
> +HandleConcurrentAbort()
>
> Can we change the comments as "Error out, if CheckXidAlive is aborted.
> We can't directly use TransactionIdDidAbort as after crash such
> transaction might not have been marked as aborted."
>
> After this add one empty line and then we can say something like:
> "This is a special API to check if CheckXidAlive is aborted in system
> table scan APIs.  See detailed comments at snapmgr.c where the
> variable is declared."
>
> 5. Shouldn't we add a check in table_scan_sample_next_block and
> table_scan_sample_next_tuple APIs as well?

Done

> 6.
> /*
> + * An xid value pointing to a possibly ongoing (sub)transaction.
> + * Currently used in logical decoding.  It's possible that such transactions
> + * can get aborted while the decoding is ongoing.  If CheckXidAlive is set
> + * then we will set sysbegin_called flag when we call systable_beginscan.  This
> + * is to ensure that from the pgoutput plugin we should never directly access
> + * the tableam or heap apis because we are checking for the concurrent abort
> + * only in systable_* apis.
> + */
> +TransactionId CheckXidAlive = InvalidTransactionId;
> +bool sysbegin_called = false;
>
> Can we change the above comment as "CheckXidAlive is a xid value
> pointing to a possibly ongoing (sub)transaction.  Currently, it is
> used in logical decoding.  It's possible that such transactions can
> get aborted while the decoding is ongoing in which case we skip
> decoding that particular transaction. To ensure that we check whether
> the CheckXidAlive is aborted after fetching the tuple from system
> tables.  We also ensure that during logical decoding we never directly
> access the tableam or heap APIs because we are checking for the
> concurrent aborts only in systable_* APIs."

Done

I have also fixed one issue in the patch
v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch.

Basically, the check, in ReorderBufferLargestTopTXN for selecting the
largest top transaction was incorrect so I have fixed that.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

05 мая 2020 г., 16:43:47

On Tue, May 5, 2020 at 4:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 4, 2020 at 5:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, May 1, 2020 at 8:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Apr 30, 2020 at 12:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > But can't they access other catalogs like pg_publication*?  I think
> > > > the basic thing we want to ensure here is that all historic accesses
> > > > always use systable* APIs to access catalogs.  We can ensure that via
> > > > having Asserts (or elog(ERROR, ..) in heap/tableam APIs.
> > >
> > > Yeah, it can.  So I have changed it now, actually along with
> > > CheckXidLive, I have kept one more flag so whenever CheckXidLive is
> > > set and we pass through systable_beginscan we will set that flag.  So
> > > while accessing the tableam API we will set if CheckXidLive is set
> > > then another flag must also be set otherwise we through an error.
> > >
> >
> > Okay, I have reviewed these changes and below are my comments:
> >
> > Review of  v17-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > --------------------------------------------------------------------
> > 1.
> > + /*
> > + * If CheckXidAlive is set then set a flag that this call is passed through
> > + * systable_beginscan.  See detailed  comments at snapmgr.c where these
> > + * variables are declared.
> > + */
> > + if (TransactionIdIsValid(CheckXidAlive))
> > + sysbegin_called = true;
> >
> > a. How about calling this variable as bsysscan or sysscan instead of
> > sysbegin_called?
>
> Done
>
> > b. There is an extra space between detailed and comments.  A similar
> > change is required at other place where this comment is used.
>
> Done
>
> > c. How about writing the first line as "If CheckXidAlive is set then
> > set a flag to indicate that system table scan is in-progress."
> >
> > 2.
> > -     Any actions leading to transaction ID assignment are prohibited.
> > That, among others,
> > -     includes writing to tables, performing DDL changes, and
> > -     calling <literal>pg_current_xact_id()</literal>.
> > +     Note that access to user catalog tables or regular system
> > catalog tables in
> > +     the output plugins has to be done via the
> > <literal>systable_*</literal> scan
> > +     APIs only. The user tables should not be accesed in the output
> > plugins anyways.
> > +     Access via the <literal>heap_*</literal> scan APIs will error out.
> >
> > The line "The user tables should not be accesed in the output plugins
> > anyways." seems a bit of out of place.  I don't think this is required
> > here.  If you read the previous paragraph in the same document it is
> > written: "Read only access to relations is permitted as long as only
> > relations are accessed that either have been created by
> > <command>initdb</command> in the <literal>pg_catalog</literal> schema,
> > or have been marked as user provided catalog tables using ...".  I
> > think that is sufficient to convey the information that the newly
> > added line by you is trying to convey.
>
> Right.
>
> >
> > 3.
> > + /*
> > + * We don't expect direct calls to this routine when CheckXidAlive is a
> > + * valid transaction id, this should only come through systable_* call.
> > + * CheckXidAlive is set during logical decoding of a transactions.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && !sysbegin_called))
> > + elog(ERROR, "unexpected heap_getnext call during logical decoding");
> >
> > How about changing this comment as "We don't expect direct calls to
> > heap_getnext with valid CheckXidAlive for catalog or regular tables.
> > See detailed comments at snapmgr.c where these variables are
> > declared."?  Change the similar comment used in other places in the
> > patch.
> >
> > For this specific API, we can also say "Normally we have such a check
> > at tableam level API but this is called from many places so we need to
> > ensure it here."
>
> Done
>
> >
> > 4.
> > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we error
> > + * out.  We can't directly use TransactionIdDidAbort as after crash such
> > + * transaction might not have been marked as aborted.  See detailed  comments
> > + * at snapmgr.c where the variable is declared.
> > + */
> > +static inline void
> > +HandleConcurrentAbort()
> >
> > Can we change the comments as "Error out, if CheckXidAlive is aborted.
> > We can't directly use TransactionIdDidAbort as after crash such
> > transaction might not have been marked as aborted."
> >
> > After this add one empty line and then we can say something like:
> > "This is a special API to check if CheckXidAlive is aborted in system
> > table scan APIs.  See detailed comments at snapmgr.c where the
> > variable is declared."
> >
> > 5. Shouldn't we add a check in table_scan_sample_next_block and
> > table_scan_sample_next_tuple APIs as well?
>
> Done
>
> > 6.
> > /*
> > + * An xid value pointing to a possibly ongoing (sub)transaction.
> > + * Currently used in logical decoding.  It's possible that such transactions
> > + * can get aborted while the decoding is ongoing.  If CheckXidAlive is set
> > + * then we will set sysbegin_called flag when we call systable_beginscan.  This
> > + * is to ensure that from the pgoutput plugin we should never directly access
> > + * the tableam or heap apis because we are checking for the concurrent abort
> > + * only in systable_* apis.
> > + */
> > +TransactionId CheckXidAlive = InvalidTransactionId;
> > +bool sysbegin_called = false;
> >
> > Can we change the above comment as "CheckXidAlive is a xid value
> > pointing to a possibly ongoing (sub)transaction.  Currently, it is
> > used in logical decoding.  It's possible that such transactions can
> > get aborted while the decoding is ongoing in which case we skip
> > decoding that particular transaction. To ensure that we check whether
> > the CheckXidAlive is aborted after fetching the tuple from system
> > tables.  We also ensure that during logical decoding we never directly
> > access the tableam or heap APIs because we are checking for the
> > concurrent aborts only in systable_* APIs."
>
> Done
>
> I have also fixed one issue in the patch
> v18-0010-Bugfix-handling-of-incomplete-toast-tuple.patch.
>
> Basically, the check, in ReorderBufferLargestTopTXN for selecting the
> largest top transaction was incorrect so I have fixed that.

There was one unrelated bug fix in v18-0010 patch reported by Neha
Sharma offlist so sending the updated version.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

07 мая 2020 г., 15:46:44

On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed one more issue in 0010 patch.  The issue was that once
the transaction is serialized due to the incomplete toast after
streaming the serialized store was not cleaned up so it was streaming
the same tuple multiple times.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Tue, May 12, 2020 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 7, 2020 at 6:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 5, 2020 at 7:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have fixed one more issue in 0010 patch.  The issue was that once
> > the transaction is serialized due to the incomplete toast after
> > streaming the serialized store was not cleaned up so it was streaming
> > the same tuple multiple times.
> >
>
> I have reviewed a few patches (003, 004, and 005) and below are my comments.
>
> v20-0003-Extend-the-output-plugin-API-with-stream-methods
> ----------------------------------------------------------------------------------------
> 2.
> +   <para>
> +    Similar to spill-to-disk behavior, streaming is triggered when the total
> +    amount of changes decoded from the WAL (for all in-progress transactions)
> +    exceeds limit defined by
> <varname>logical_decoding_work_mem</varname> setting.
> +    At that point the largest toplevel transaction (measured by
> amount of memory
> +    currently used for decoded changes) is selected and streamed.
> +   </para>
>
> I think we need to explain here the cases/exception where we need to
> spill even when stream is enabled and check if this is per latest
> implementation, otherwise, update it.

Done

> 3.
> + * To support streaming, we require change/commit/abort callbacks. The
> + * message callback is optional, similarly to regular output plugins.
>
> /similarly/similar

Done

> 4.
> +static void
> +stream_start_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> +
> + Assert(!ctx->fast_forward);
> +
> + /* We're only supposed to call this when streaming is supported. */
> + Assert(ctx->streaming);
> +
> + /* Push callback + info on the error context stack */
> + state.ctx = ctx;
> + state.callback_name = "stream_start";
> + /* state.report_location = apply_lsn; */
>
> Why can't we supply the report_location here?  I think here we need to
> report txn->first_lsn if this is the very first stream and
> txn->final_lsn if it is any consecutive one.

Done

> 5.
> +static void
> +stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn)
> +{
> + LogicalDecodingContext *ctx = cache->private_data;
> + LogicalErrorCallbackState state;
> + ErrorContextCallback errcallback;
> +
> + Assert(!ctx->fast_forward);
> +
> + /* We're only supposed to call this when streaming is supported. */
> + Assert(ctx->streaming);
> +
> + /* Push callback + info on the error context stack */
> + state.ctx = ctx;
> + state.callback_name = "stream_stop";
> + /* state.report_location = apply_lsn; */
>
> Can't we report txn->final_lsn here

We are already setting this to the  txn->final_ls in 0006 patch, but I
have moved it into this patch now.

> 6. I think it will be good if we can provide an example of streaming
> changes via test_decoding at
> https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> can also explain there why the user is not expected to see the actual
> data in the stream.

I have a few problems to solve here.
-  With streaming transaction also shall we show the actual values or
we shall do like it is currently in the patch
(appendStringInfo(ctx->out, "streaming change for TXN %u",
txn->xid);).  I think we should show the actual values instead of what
we are doing now.
- In the example we can not show a real example, because of the
in-progress transaction to show the changes, we might have to
implement a lot of tuple.  I think we can show the partial output?

> v20-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> ----------------------------------------------------------------------------------------
> 7.
> + /*
> + * We don't expect direct calls to table_tuple_get_latest_tid with valid
> + * CheckXidAlive  for catalog or regular tables.
>
> There is an extra space between 'CheckXidAlive' and 'for'.  I can see
> similar problems in other places as well where this comment is used,
> fix those as well.

Done

> 8.
> +/*
> + * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
> + * transaction.  Currently, it is used in logical decoding.  It's possible
> + * that such transactions can get aborted while the decoding is ongoing in
> + * which case we skip decoding that particular transaction. To ensure that we
> + * check whether the CheckXidAlive is aborted after fetching the tuple from
> + * system tables.  We also ensure that during logical decoding we never
> + * directly access the tableam or heap APIs because we are checking for the
> + * concurrent aborts only in systable_* APIs.
> + */
>
> In this comment, there is an inconsistency in the space used after
> completing the sentence. In the part "transaction. To", single space
> is used whereas at other places two spaces are used after a full stop.

Done


> v20-0005-Implement-streaming-mode-in-ReorderBuffer
> -----------------------------------------------------------------------------
> 9.
> Implement streaming mode in ReorderBuffer
>
> Instead of serializing the transaction to disk after reaching the
> maximum number of changes in memory (4096 changes), we consume the
> changes we have in memory and invoke new stream API methods. This
> happens in ReorderBufferStreamTXN() using about the same logic as
> in ReorderBufferCommit() logic.
>
> I think the above part of the commit message needs to be updated.

Done

> 10.
> Theoretically, we could get rid of the k-way merge, and append the
> changes to the toplevel xact directly (and remember the position
> in the list in case the subxact gets aborted later).
>
> I don't think this part of the commit message is correct as we
> sometimes need to spill even during streaming.  Please check the
> entire commit message and update according to the latest
> implementation.

Done

> 11.
> - * HeapTupleSatisfiesHistoricMVCC.
> + * tqual.c's HeapTupleSatisfiesHistoricMVCC.
> + *
> + * We do build the hash table even if there are no CIDs. That's
> + * because when streaming in-progress transactions we may run into
> + * tuples with the CID before actually decoding them. Think e.g. about
> + * INSERT followed by TRUNCATE, where the TRUNCATE may not be decoded
> + * yet when applying the INSERT. So we build a hash table so that
> + * ResolveCminCmaxDuringDecoding does not segfault in this case.
> + *
> + * XXX We might limit this behavior to streaming mode, and just bail
> + * out when decoding transaction at commit time (at which point it's
> + * guaranteed to see all CIDs).
>   */
>  static void
>  ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
> @@ -1350,9 +1498,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer
> *rb, ReorderBufferTXN *txn)
>   dlist_iter iter;
>   HASHCTL hash_ctl;
>
> - if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> - return;
> -
>
> I don't understand this change.  Why would "INSERT followed by
> TRUNCATE" could lead to a tuple which can come for decode before its
> CID?  The patch has made changes based on this assumption in
> HeapTupleSatisfiesHistoricMVCC which appears to be very risky as the
> behavior could be dependent on whether we are streaming the changes
> for in-progress xact or at the commit of a transaction.  We might want
> to generate a test to once validate this behavior.
>
> Also, the comment refers to tqual.c which is wrong as this API is now
> in heapam_visibility.c.

Done.

> 12.
> + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access.  Also reset the
> + * sysbegin_called flag.
>   */
> - if (txn->base_snapshot == NULL)
> + if (!TransactionIdDidCommit(xid))
>   {
> - Assert(txn->ninvalidations == 0);
> - ReorderBufferCleanupTXN(rb, txn);
> - return;
> + CheckXidAlive = xid;
> + bsysscan = false;
>   }
>
> In the comment, the flag name 'sysbegin_called' should be bsysscan.

Done



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 15, 2020 at 2:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > > 6. I think it will be good if we can provide an example of streaming
> > > changes via test_decoding at
> > > https://www.postgresql.org/docs/devel/test-decoding.html. I think we
> > > can also explain there why the user is not expected to see the actual
> > > data in the stream.
> >
> > I have a few problems to solve here.
> > -  With streaming transaction also shall we show the actual values or
> > we shall do like it is currently in the patch
> > (appendStringInfo(ctx->out, "streaming change for TXN %u",
> > txn->xid);).  I think we should show the actual values instead of what
> > we are doing now.
> >
>
> I think why we don't want to display the tuple at this stage is
> because it is not clear by this time if the transaction will commit or
> abort.  I am not sure if displaying the contents of aborted
> transactions is a good idea but if there is a reason for doing so, we
> can do it later as well.
>
> > - In the example we can not show a real example, because of the
> > in-progress transaction to show the changes, we might have to
> > implement a lot of tuple.  I think we can show the partial output?
> >
>
> I think we can display what API will actually display, what is the
> confusion here.

Added example in the v22-0011 patch where I have added the API to get
streaming changes.

> I have a few more comments on the previous version of patch
> v20-0005-Implement-streaming-mode-in-ReorderBuffer.  If you have fixed
> any, then leave those and fix others.
>
> Review comments:
> ------------------------------
> 1.
> @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> TransactionId xid,
>   }
>
>   case REORDER_BUFFER_CHANGE_MESSAGE:
> - rb->message(rb, txn, change->lsn, true,
> - change->data.msg.prefix,
> - change->data.msg.message_size,
> - change->data.msg.message);
> + if (streaming)
> + rb->stream_message(rb, txn, change->lsn, true,
> +    change->data.msg.prefix,
> +    change->data.msg.message_size,
> +    change->data.msg.message);
> + else
> + rb->message(rb, txn, change->lsn, true,
> +    change->data.msg.prefix,
> +    change->data.msg.message_size,
> +    change->data.msg.message);
>
> Don't we need to set any_data_sent flag while streaming messages as we
> do for other types of changes?

I think any_data_sent, was added to avoid sending abort to the
subscriber if we haven't sent any data,  but this is not complete as
the output plugin can also take the decision not to send.  So I think
this should not be done as part of this patch and can be done
separately.  I think there is already a thread for handling the
same[1]


> 2.
> + if (streaming)
> + {
> + /*
> + * Set the last of the stream as the final lsn before calling
> + * stream stop.
> + */
> + if (!XLogRecPtrIsInvalid(prev_lsn))
> + txn->final_lsn = prev_lsn;
> + rb->stream_stop(rb, txn);
> + }
>
> I am not sure if it is good to use final_lsn for this purpose.  See
> comments for this variable in reorderbuffer.h.  Basically, it is used
> for a specific purpose on different occasions.  Now, if we want to
> start using it for a new purpose, we need to study its interaction
> with all other places and update the comments as well.  Can we pass an
> additional parameter to stream_stop() instead?

Done

> 3.
> + /* remember the command ID and snapshot for the streaming run */
> + txn->command_id = command_id;
> +
> + /* Avoid copying if it's already copied. */
> + if (snapshot_now->copied)
> + txn->snapshot_now = snapshot_now;
> + else
> + txn->snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> +   txn, command_id);
>
> This code is used at two different places, can we try to keep this in
> a single function.

Done

> 4.
> In ReorderBufferProcessTXN(), the patch is calling stream_stop in both
> the try and catch block.  If there is an error after calling it in a
> try block, we might call it again via catch.  I think that will lead
> to sending a stop message twice.  Won't that be a problem?  See the
> usage of iterstate in the catch block, we have made it safe from a
> similar problem.

IMHO, we don't need that, because we only call stream_stop in the
catch block if the error type is ERRCODE_TRANSACTION_ROLLBACK.  So if
in TRY block we have already stopped the stream then we should not get
that error.  I have added the comments for the same.

> 5.
> + if (streaming)
> + {
> + /* Discard the changes that we just streamed. */
> + ReorderBufferTruncateTXN(rb, txn);
>
> - PG_RE_THROW();
> + /* Re-throw only if it's not an abort. */
> + if (errdata->sqlerrcode != ERRCODE_TRANSACTION_ROLLBACK)
> + {
> + MemoryContextSwitchTo(ecxt);
> + PG_RE_THROW();
> + }
> + else
> + {
> + FlushErrorState();
> + FreeErrorData(errdata);
> + errdata = NULL;
> +
>
> I think here we can write few comments on why we are doing error-code
> specific handling, basically, explain a bit about concurrent abort
> handling and or refer to the part of comments where it is explained.

Done

> 6.
> PG_CATCH();
>   {
> + MemoryContext ecxt = MemoryContextSwitchTo(ccxt);
> + ErrorData  *errdata = CopyErrorData();
>
> I don't understand the usage of memory context in this part of the
> code.  Basically, you are switching to CurrentMemoryContext here, do
> some error handling and then again reset back to some random context
> before rethrowing the error.  If there is some purpose for it, then it
> might be better if you can write a few comments to explain the same.

Basically, the ccxt is the CurrentMemoryContext when we started the
streaming and ecxt it the context when we catch the error.  So
ideally, before this change, it will rethrow in the context when we
catch the error i.e. ecxt.  So what we are trying to do is put it back
to normal context (ccxt) and copy the error data in the normal
context.  And, if we are not handling it gracefully then put it back
to the context it was in, and rethrow.

>
> 7.
> +ReorderBufferCommit()
> {
> ..
> + /*
> + * If the transaction was (partially) streamed, we need to commit it in a
> + * 'streamed' way. That is, we first stream the remaining part of the
> + * transaction, and then invoke stream_commit message.
> + *
> + * XXX Called after everything (origin ID and LSN, ...) is stored in the
> + * transaction, so we don't pass that directly.
> + *
> + * XXX Somewhat hackish redirection, perhaps needs to be refactored?
> + */
> + if (rbtxn_is_streamed(txn))
> + {
> + ReorderBufferStreamCommit(rb, txn);
> + return;
> + }
> +
> ..
> }
>
> "XXX Somewhat hackish redirection, perhaps needs to be refactored?"
> What kind of refactoring we can do here?  To me, it looks okay.

I think it looks fine to me also.  So I have removed this comment.

> 8.
> @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> *rb, TransactionId xid,
>   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
>
>   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> +
> + /*
> + * TOCHECK: Mark toplevel transaction as having catalog changes too
> + * if one of its children has.
> + */
> + if (txn->toptxn != NULL)
> + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
>  }
>
> Why are we marking top transaction here?

We need to mark top transaction to decide whether to build tuplecid
hash or not.  In non-streaming mode, we are only sending during the
commit time, and during commit time we know whether the top
transaction has any catalog changes or not based on the invalidation
message so we are marking the top transaction there in DecodeCommit.
Since here we are not waiting till commit so we need to mark the top
transaction as soon as we mark any of its child transactions.

[1] https://www.postgresql.org/message-id/CAMkU=1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q@mail.gmail.com


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
>
> Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
> 1.
> + /*
> + * If this is a toast insert then set the corresponding bit.  Otherwise, if
> + * we have toast insert bit set and this is insert/update then clear the
> + * bit.
> + */
> + if (toast_insert)
> + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
> + else if (rbtxn_has_toast_insert(txn) &&
> + ChangeIsInsertOrUpdate(change->action))
> + {
>
> Here, it might better to add a comment on why we expect only
> Insert/Update?  Also, it might be better that we add an assert for
> other operations.

I have added comments that why on Insert/Update we clean the flag.
But I don't think we only expect insert/update,  we might get the
toast delete right? because in toast update we will do toast delete +
toast insert.  So when we get toast delete we just don't want to do
anything.

>
> 2.
> @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
>   * disk.
>   */
>   dlist_delete(&change->node);
> - ReorderBufferToastAppendChunk(rb, txn, relation,
> -   change);
> + ReorderBufferToastAppendChunk(rb, txn, relation,
> +   change);
>   }
>
> This seems to be a spurious change.

Done

> 3.
> + /*
> + * If streaming is enable and we have serialized this transaction because
> + * it had incomplete tuple.  So if now we have got the complete tuple we
> + * can stream it.
> + */
> + if (ReorderBufferCanStream(rb) && can_stream && rbtxn_is_serialized(toptxn)
> + && !(rbtxn_has_toast_insert(txn)) && !(rbtxn_has_spec_insert(txn)))
> + {
>
> This comment is just saying what you are doing in the if-check.  I
> think you need to explain the rationale behind it. I don't like the
> variable name 'can_stream' because it matches ReorderBufferCanStream
> whereas it is for a different purpose, how about naming it as
> 'change_complete' or something like that.  The check has many
> conditions, can we move it to a separate function to make the code
> here look clean?

As per the other comments we have removed this part in the latest patch set.

Apart from these comments fixes, there are 2 more changes
1.  Handling of the toast tuple is changed as per the offlist
discussion with you
Basically, now, instead of not streaming the txn with the incomplete
tuple, we are streaming it up to the last complete lsn.  So of the txn
has incomplete changes but its complete size is largest then we will
stream this.  And, after streaming we will truncate the transaction up
to the last complete lsn.

2. There is a bug fix in handling the stream abort in 0008 (earlier it
was 0006).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
>
> I have further reviewed v22 and below are my comments:
>
> v22-0005-Implement-streaming-mode-in-ReorderBuffer
> --------------------------------------------------------------------------
> 1.
> + * Note: We never do both stream and serialize a transaction (we only spill
> + * to disk when streaming is not supported by the plugin), so only one of
> + * those two flags may be set at any given time.
> + */
> +#define rbtxn_is_streamed(txn) \
> +( \
> + ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
> +)
>
> The above 'Note' is not correct as per the latest implementation.

That is removed in 0010 in the latest version you can see in 0006.

> v22-0006-Add-support-for-streaming-to-built-in-replicatio
> ----------------------------------------------------------------------------
> 2.
> --- a/src/backend/replication/logical/launcher.c
> +++ b/src/backend/replication/logical/launcher.c
> @@ -14,7 +14,6 @@
>   *
>   *-------------------------------------------------------------------------
>   */
> -
>  #include "postgres.h"
>
> Spurious line removal.

Fixed

> 3.
> +void
> +logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN *txn,
> +    XLogRecPtr commit_lsn)
> +{
> + uint8 flags = 0;
> +
> + pq_sendbyte(out, 'c'); /* action STREAM COMMIT */
> +
> + Assert(TransactionIdIsValid(txn->xid));
> +
> + /* transaction ID (we're starting to stream, so must be valid) */
> + pq_sendint32(out, txn->xid);
>
> The part of the comment "we're starting to stream, so must be valid"
> is not correct as we are not at the start of the stream here.  The
> patch has used the same incorrect sentence at few places, kindly fix
> those as well.

I have removed that part of the comment.

> 4.
> + * XXX Do we need to allocate it in TopMemoryContext?
> + */
> +static void
> +subxact_info_add(TransactionId xid)
> {
> ..
>
> For this and other places in a patch like in function
> stream_open_file(), instead of using TopMemoryContext, can we consider
> using a new memory context LogicalStreamingContext or something like
> that. We can create LogicalStreamingContext under TopMemoryContext.  I
> don't see any need of using TopMemoryContext here.

But, when we will delete/reset the LogicalStreamingContext?  because
we are planning to keep this memory until the worker is alive so that
supposed to be the top memory context.  If we create any other context
with the same life span as TopMemoryContext then what is the point?
Am I missing something?

> 5.
> +static void
> +subxact_info_add(TransactionId xid)
>
> This function has assumed a valid value for global variables like
> stream_fd and stream_xid.  I think it is better to have Assert for
> those in this function before using them.  The Assert for those are
> present in handle_streamed_transaction but I feel they should be in
> subxact_info_add.

Done

> 6.
> +subxact_info_add(TransactionId xid)
> /*
> + * In most cases we're checking the same subxact as we've already seen in
> + * the last call, so make ure just ignore it (this change comes later).
> + */
> + if (subxact_last == xid)
> + return;
>
> Typo and minor correction, /ure just/sure to

Done

> 7.
> +subxact_info_write(Oid subid, TransactionId xid)
> {
> ..
> + /*
> + * But we free the memory allocated for subxact info. There might be one
> + * exceptional transaction with many subxacts, and we don't want to keep
> + * the memory allocated forewer.
> + *
> + */
>
> a. Typo, /forewer/forever
> b. The extra line at the end of the comment is not required.

Done


> 8.
> + * XXX Maybe we should only include the checksum when the cluster is
> + * initialized with checksums?
> + */
> +static void
> +subxact_info_write(Oid subid, TransactionId xid)
>
> Do we really need to have the checksum for temporary files? I have
> checked a few other similar cases like SharedFileSet stuff for
> parallel hash join but didn't find them using checksums.  Can you also
> once see other usages of temporary files and then let us decide if we
> see any reason to have checksums for this?

Yeah, even I can see other places checksum is not used.

>
> Another point is we don't seem to be doing this for 'changes' file,
> see stream_write_change.  So, not sure, there is any sense to write
> checksum for subxact file.

I can see there are comment atop this function

* XXX The subxact file includes CRC32C of the contents. Maybe we should
* include something like that here too, but doing so will not be as
* straighforward, because we write the file in chunks.

>
> Tomas, do you see any reason for the same?


> 9.
> +subxact_filename(char *path, Oid subid, TransactionId xid)
> +{
> + char tempdirpath[MAXPGPATH];
> +
> + TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
> +
> + /*
> + * We might need to create the tablespace's tempfile directory, if no
> + * one has yet done so.
> + */
> + if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not create directory \"%s\": %m",
> + tempdirpath)));
> +
> + snprintf(path, MAXPGPATH, "%s/logical-%u-%u.subxacts",
> + tempdirpath, subid, xid);
> +}
>
> Temporary files created in PGDATA/base/pgsql_tmp follow a certain
> naming convention (see docs[1]) which is not followed here.  You can
> also refer SharedFileSetPath and OpenTemporaryFile.  I think we can
> just try to follow that convention and then additionally append subid,
> xid and .subxacts.  Also, a similar change is required for
> changes_filename.  I would like to know if there is a reason why we
> want to use different naming convention here?

I have changed it to this: pgsql_tmpPID-subid-xid.subxacts.

> 10.
> + * This can only be called at the beginning of a "streaming" block, i.e.
> + * between stream_start/stream_stop messages from the upstream.
> + */
> +static void
> +stream_close_file(void)
>
> The comment seems to be wrong.  I think this can be only called at
> stream end, so it should be "This can only be called at the end of a
> "streaming" block, i.e. at stream_stop message from the upstream."

Right, I have fixed it.

> 11.
> + * the order the transactions are sent in. So streamed trasactions are
> + * handled separately by using schema_sent flag in ReorderBufferTXN.
> + *
>   * For partitions, 'pubactions' considers not only the table's own
>   * publications, but also those of all of its ancestors.
>   */
>  typedef struct RelationSyncEntry
>  {
>   Oid relid; /* relation oid */
> -
> + TransactionId xid; /* transaction that created the record */
>   /*
>   * Did we send the schema?  If ancestor relid is set, its schema must also
>   * have been sent for this to be true.
>   */
>   bool schema_sent;
> + List    *streamed_txns; /* streamed toplevel transactions with this
> + * schema */
>
> The part of comment "So streamed trasactions are handled separately by
> using schema_sent flag in ReorderBufferTXN." doesn't seem to match
> with what we are doing in the latest version of the patch.

Yeah, it's wrong,  I have fixed it.


> 12.
> maybe_send_schema()
> {
> ..
> + if (in_streaming)
> + {
> + /*
> + * TOCHECK: We have to send schema after each catalog change and it may
> + * occur when streaming already started, so we have to track new catalog
> + * changes somehow.
> + */
> + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
> ..
> ..
> }
>
> I think it is good to once verify/test what this comment says but as
> per code we should be sending the schema after each catalog change as
> we invalidate the streamed_txns list in rel_sync_cache_relation_cb
> which must be called during relcache invalidation.  Do we see any
> problem with that mechanism?

I have tested this, I think we are already sending the schema after
each catalog change.

> 13.
> +/*
> + * Notify downstream to discard the streamed transaction (along with all
> + * it's subtransactions, if it's a toplevel transaction).
> + */
> +static void
> +pgoutput_stream_commit(struct LogicalDecodingContext *ctx,
> +    ReorderBufferTXN *txn,
> +    XLogRecPtr commit_lsn)
>
> This comment is copied from pgoutput_stream_abort, so doesn't match
> what this function is doing.

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v24.tar

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

25 мая 2020 г., 17:37:49

On Fri, May 22, 2020 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > v22-0006-Add-support-for-streaming-to-built-in-replicatio
> > ----------------------------------------------------------------------------
> >
> Few more comments on v22-0006 patch:
>
> 1.
> +stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
> +{
> + int i;
> + char path[MAXPGPATH];
> + bool found = false;
> +
> + subxact_filename(path, subid, xid);
> +
> + if ((unlink(path) < 0) && (errno != ENOENT) && !missing_ok)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not remove file \"%s\": %m", path)));
>
> Here, we have unlinked the files containing information of subxacts
> but don't we need to free the corresponding memory (memory for
> subxacts) as well?

Basically, stream_cleanup_files, is used for
1) cleanup file on worker exit
2) while writing the first segment of the xid we clean up to ensure
there are no orphaned file with same xid.
3) After apply commit we clean up the file.

Whereas subxacts memory is used between the stream start and stream
stop as soon stream stop we write the subxacts changes to file and
free the memory.  So there is no case that we can have subxact memory
at stream_cleanup_files, except on worker exit but there we are
already exiting the worker. IMHO we don't need to free memory there.

> 2.
> apply_handle_stream_abort()
> {
> ..
> + subxact_filename(path, MyLogicalRepWorker->subid, xid);
> +
> + if (unlink(path) < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not remove file \"%s\": %m", path)));
> +
> + return;
> ..
> }
>
> Like the previous comment, it seems here also we need to free subxacts
> memory and additionally we forgot to adjust the xids array as well.

In this, we are allocating memory in subxact_info_read, but we are
again calling subxact_info_write which will free the memory.

> 3.
> apply_handle_stream_abort()
> {
> ..
> + /* XXX optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> + {
> + if (subxacts[i - 1].xid == subxid)
> + {
> + subidx = (i - 1);
> + found = true;
> + break;
> + }
> + }
> +
> + if (!found)
> + return;
> ..
> }
>
> Is it possible that we didn't find the xid in subxacts array?  If so,
> I think we should mention the same in comments, otherwise, we should
> have an assert for found.

We may not find due to the empty transaction, I have changed the comments.

> 4.
> apply_handle_stream_abort()
> {
> ..
> + changes_filename(path, MyLogicalRepWorker->subid, xid);
> +
> + if (truncate(path, subxacts[subidx].offset))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not truncate file \"%s\": %m", path)));
> ..
> }
>
> Will truncate works on Windows?  I see in the code we ftruncate which
> is defined as chsize in win32.h and win32_port.h.  I have not tested
> this so I am not very sure about this.  I got a below warning when I
> tried to compile this code on Windows.  I think it is better to
> ftruncate as it is used at other places in the code as well.
>
> worker.c(798): warning C4013: 'truncate' undefined; assuming extern
> returning int

I have changed to the ftruncate.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Erik Rijkers

Дата:

25 мая 2020 г., 18:18:49

On 2020-05-25 16:37, Dilip Kumar wrote:
> On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> 
> wrote:
>> 
>> On Tue, May 19, 2020 at 6:01 PM Amit Kapila <amit.kapila16@gmail.com> 
>> wrote:
>> >
>> > On Fri, May 15, 2020 at 2:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> > >
>> 
>> I have further reviewed v22 and below are my comments:
>> 

>>    [v24.tar]

Hi,

I am not able to extract all files correctly from this tar.

The first file v24-0001-* seems to have some 'binary' junk at the top.

(The other 11 files seem normally readably)


Erik Rijkers

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

26 мая 2020 г., 05:15:19

On Mon, May 25, 2020 at 8:48 PM Erik Rijkers <er@xs4all.nl> wrote:
>

> Hi,
>
> I am not able to extract all files correctly from this tar.
>
> The first file v24-0001-* seems to have some 'binary' junk at the top.
>
> (The other 11 files seem normally readably)

Okay, sending again.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v24.tar

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

26 мая 2020 г., 07:57:27

On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
> > 1.
> > + /*
> > + * If this is a toast insert then set the corresponding bit.  Otherwise, if
> > + * we have toast insert bit set and this is insert/update then clear the
> > + * bit.
> > + */
> > + if (toast_insert)
> > + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
> > + else if (rbtxn_has_toast_insert(txn) &&
> > + ChangeIsInsertOrUpdate(change->action))
> > + {
> >
> > Here, it might better to add a comment on why we expect only
> > Insert/Update?  Also, it might be better that we add an assert for
> > other operations.
>
> I have added comments that why on Insert/Update we clean the flag.
> But I don't think we only expect insert/update,  we might get the
> toast delete right? because in toast update we will do toast delete +
> toast insert.  So when we get toast delete we just don't want to do
> anything.
>

Okay, that makes sense.

> >
> > 2.
> > @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> > ReorderBufferTXN *txn,
> >   * disk.
> >   */
> >   dlist_delete(&change->node);
> > - ReorderBufferToastAppendChunk(rb, txn, relation,
> > -   change);
> > + ReorderBufferToastAppendChunk(rb, txn, relation,
> > +   change);
> >   }
> >
> > This seems to be a spurious change.
>
> Done
>
> 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> was 0006).
>

The code changes look fine but it is not clear what was the exact
issue.  Can you explain?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

26 мая 2020 г., 09:30:12

On Fri, May 22, 2020 at 6:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > Review comments:
> > > > ------------------------------
> > > > 1.
> > > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > > > TransactionId xid,
> > > >   }
> > > >
> > > >   case REORDER_BUFFER_CHANGE_MESSAGE:
> > > > - rb->message(rb, txn, change->lsn, true,
> > > > - change->data.msg.prefix,
> > > > - change->data.msg.message_size,
> > > > - change->data.msg.message);
> > > > + if (streaming)
> > > > + rb->stream_message(rb, txn, change->lsn, true,
> > > > +    change->data.msg.prefix,
> > > > +    change->data.msg.message_size,
> > > > +    change->data.msg.message);
> > > > + else
> > > > + rb->message(rb, txn, change->lsn, true,
> > > > +    change->data.msg.prefix,
> > > > +    change->data.msg.message_size,
> > > > +    change->data.msg.message);
> > > >
> > > > Don't we need to set any_data_sent flag while streaming messages as we
> > > > do for other types of changes?
> > >
> > > I think any_data_sent, was added to avoid sending abort to the
> > > subscriber if we haven't sent any data,  but this is not complete as
> > > the output plugin can also take the decision not to send.  So I think
> > > this should not be done as part of this patch and can be done
> > > separately.  I think there is already a thread for handling the
> > > same[1]
> > >
> >
> > Hmm, but prior to this patch, we never use to send (empty) aborts but
> > now that will be possible. It is probably okay to deal that with
> > another patch mentioned by you but I felt at least any_data_sent will
> > work for some cases.  OTOH, it appears to be half-baked solution, so
> > we should probably refrain from adding it.  BTW, how do the pgoutput
> > plugin deal with it? I see that apply_handle_stream_abort will
> > unconditionally try to unlink the file and it will probably fail.
> > Have you tested this scenario after your latest changes?
>
> Yeah, I see, I think this is a problem,  but this exists without my
> latest change as well, if pgoutput ignore some changes because it is
> not published then we will see a similar error.  Shall we handle the
> ENOENT error case from unlink?
>

Isn't this problem only for subxact file as we anyway create changes
file as part of start stream message which should have come after
abort?  If so, can't we detect whether subxact file exists probably by
using nsubxacts or something like that?  Can you please once try to
reproduce this scenario to ensure that we are not missing anything?

>
>
> > > > 8.
> > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > *rb, TransactionId xid,
> > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > >
> > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > +
> > > > + /*
> > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > + * if one of its children has.
> > > > + */
> > > > + if (txn->toptxn != NULL)
> > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > >  }
> > > >
> > > > Why are we marking top transaction here?
> > >
> > > We need to mark top transaction to decide whether to build tuplecid
> > > hash or not.  In non-streaming mode, we are only sending during the
> > > commit time, and during commit time we know whether the top
> > > transaction has any catalog changes or not based on the invalidation
> > > message so we are marking the top transaction there in DecodeCommit.
> > > Since here we are not waiting till commit so we need to mark the top
> > > transaction as soon as we mark any of its child transactions.
> > >
> >
> > But how does it help?  We use this flag (via
> > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > anyway done in DecodeCommit and that too after setting this flag for
> > the top transaction if required.  So, how will it help in setting it
> > while processing for subxid.  Also, even if we have to do it won't it
> > add the xid needlessly in builder->committed.xip array?
>
> In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> to build the tuplecid hash or not based on whether it has catalog
> changes or not.
>

Okay, but you haven't answered the second part of the question: "won't
it add the xid of top transaction needlessly in builder->committed.xip
array, see function SnapBuildCommitTxn?"  IIUC, this can happen
without patch as well because DecodeCommit also sets the flags just
based on invalidation messages irrespective of whether the messages
are generated by top transaction or not, is that right?  If this is
correct, please explain why we are doing so in the comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

26 мая 2020 г., 12:13:59

On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 6:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, May 18, 2020 at 5:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Few comments on v20-0010-Bugfix-handling-of-incomplete-toast-tuple
> > > 1.
> > > + /*
> > > + * If this is a toast insert then set the corresponding bit.  Otherwise, if
> > > + * we have toast insert bit set and this is insert/update then clear the
> > > + * bit.
> > > + */
> > > + if (toast_insert)
> > > + toptxn->txn_flags |= RBTXN_HAS_TOAST_INSERT;
> > > + else if (rbtxn_has_toast_insert(txn) &&
> > > + ChangeIsInsertOrUpdate(change->action))
> > > + {
> > >
> > > Here, it might better to add a comment on why we expect only
> > > Insert/Update?  Also, it might be better that we add an assert for
> > > other operations.
> >
> > I have added comments that why on Insert/Update we clean the flag.
> > But I don't think we only expect insert/update,  we might get the
> > toast delete right? because in toast update we will do toast delete +
> > toast insert.  So when we get toast delete we just don't want to do
> > anything.
> >
>
> Okay, that makes sense.
>
> > >
> > > 2.
> > > @@ -1865,8 +1920,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb,
> > > ReorderBufferTXN *txn,
> > >   * disk.
> > >   */
> > >   dlist_delete(&change->node);
> > > - ReorderBufferToastAppendChunk(rb, txn, relation,
> > > -   change);
> > > + ReorderBufferToastAppendChunk(rb, txn, relation,
> > > +   change);
> > >   }
> > >
> > > This seems to be a spurious change.
> >
> > Done
> >
> > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > was 0006).
> >
>
> The code changes look fine but it is not clear what was the exact
> issue.  Can you explain?

Basically, in case of an empty subtransaction, we were reading the
subxacts info but when we could not find the subxid in the subxacts
info we were not releasing the memory.  So on next subxact_info_read
it will expect that subxacts should be freed but we did not free it in
that !found case.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

26 мая 2020 г., 12:34:02

On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 4.
> > + * XXX Do we need to allocate it in TopMemoryContext?
> > + */
> > +static void
> > +subxact_info_add(TransactionId xid)
> > {
> > ..
> >
> > For this and other places in a patch like in function
> > stream_open_file(), instead of using TopMemoryContext, can we consider
> > using a new memory context LogicalStreamingContext or something like
> > that. We can create LogicalStreamingContext under TopMemoryContext.  I
> > don't see any need of using TopMemoryContext here.
>
> But, when we will delete/reset the LogicalStreamingContext?
>

Why can't we reset it at each stream stop message?

>  because
> we are planning to keep this memory until the worker is alive so that
> supposed to be the top memory context.
>

Which part of allocation do we want to keep till the worker is alive?
Why we need memory-related to subxacts till the worker is alive?  As
we have now, after reading subxact info (subxact_info_read), we need
to ensure that it is freed after its usage due to which we need to
remember and perform pfree at various places.

I think we should once see the possibility that such that we could
switch to this new context in start stream message and reset it in
stop stream message.  That might help in avoiding
MemoryContextSwitchTo TopMemoryContext at various places.

>  If we create any other context
> with the same life span as TopMemoryContext then what is the point?
>

It is helpful for debugging.  It is recommended that we don't use the
top memory context unless it is really required.  Read about it in
src/backend/utils/mmgr/README.

>
> > 8.
> > + * XXX Maybe we should only include the checksum when the cluster is
> > + * initialized with checksums?
> > + */
> > +static void
> > +subxact_info_write(Oid subid, TransactionId xid)
> >
> > Do we really need to have the checksum for temporary files? I have
> > checked a few other similar cases like SharedFileSet stuff for
> > parallel hash join but didn't find them using checksums.  Can you also
> > once see other usages of temporary files and then let us decide if we
> > see any reason to have checksums for this?
>
> Yeah, even I can see other places checksum is not used.
>

So, unless someone speaks up before you are ready for the next version
of the patch, can we remove it?

> >
> > Another point is we don't seem to be doing this for 'changes' file,
> > see stream_write_change.  So, not sure, there is any sense to write
> > checksum for subxact file.
>
> I can see there are comment atop this function
>
> * XXX The subxact file includes CRC32C of the contents. Maybe we should
> * include something like that here too, but doing so will not be as
> * straighforward, because we write the file in chunks.
>

You can remove this comment as well.  I don't know how advantageous it
is to checksum temporary files.  We can anyway add it later if there
is a reason for doing so.

>
>
> > 12.
> > maybe_send_schema()
> > {
> > ..
> > + if (in_streaming)
> > + {
> > + /*
> > + * TOCHECK: We have to send schema after each catalog change and it may
> > + * occur when streaming already started, so we have to track new catalog
> > + * changes somehow.
> > + */
> > + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
> > ..
> > ..
> > }
> >
> > I think it is good to once verify/test what this comment says but as
> > per code we should be sending the schema after each catalog change as
> > we invalidate the streamed_txns list in rel_sync_cache_relation_cb
> > which must be called during relcache invalidation.  Do we see any
> > problem with that mechanism?
>
> I have tested this, I think we are already sending the schema after
> each catalog change.
>

Then remove "TOCHECK" in the above comment.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

26 мая 2020 г., 14:16:01

On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > > was 0006).
> > >
> >
> > The code changes look fine but it is not clear what was the exact
> > issue.  Can you explain?
>
> Basically, in case of an empty subtransaction, we were reading the
> subxacts info but when we could not find the subxid in the subxacts
> info we were not releasing the memory.  So on next subxact_info_read
> it will expect that subxacts should be freed but we did not free it in
> that !found case.
>

Okay, on looking at it again, the same code exists in
subxact_info_write as well.  It is better to have a function for it.
Can we have a structure like SubXactContext for all the variables used
for subxact?  As mentioned earlier I find the allocation/deallocation
of subxacts a bit ad-hoc, so there will always be a chance that we can
forget to free it.  Having it allocated in memory context which we can
reset later might reduce that risk.  One idea could be that we have a
special memory context for start and stop messages which can be used
to allocate the subxacts there.  In case of commit/abort, we can allow
subxacts information to be allocated in ApplyMessageContext which is
reset at the end of each protocol message.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Mahendra Singh Thalor

Дата:

27 мая 2020 г., 14:49:04

On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > > > was 0006).
> > > >
> > >
> > > The code changes look fine but it is not clear what was the exact
> > > issue. Can you explain?
> >
> > Basically, in case of an empty subtransaction, we were reading the
> > subxacts info but when we could not find the subxid in the subxacts
> > info we were not releasing the memory. So on next subxact_info_read
> > it will expect that subxacts should be freed but we did not free it in
> > that !found case.
> >
>
> Okay, on looking at it again, the same code exists in
> subxact_info_write as well. It is better to have a function for it.
> Can we have a structure like SubXactContext for all the variables used
> for subxact? As mentioned earlier I find the allocation/deallocation
> of subxacts a bit ad-hoc, so there will always be a chance that we can
> forget to free it. Having it allocated in memory context which we can
> reset later might reduce that risk. One idea could be that we have a
> special memory context for start and stop messages which can be used
> to allocate the subxacts there. In case of commit/abort, we can allow
> subxacts information to be allocated in ApplyMessageContext which is
> reset at the end of each protocol message.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>

Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;

Test parameters:
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'

Test results:

		CREATE index operations			Add col int(date) operations			Add col text operations
SN.	operation name	LSN diff (in bytes)	time (in sec)	% LSN change	LSN diff (in bytes)	time (in sec)	% LSN change	LSN diff (in bytes)	time (in sec)	% LSN change
1	1 DDL without patch	17728	0.89116	1.624548	976	0.764393	11.475409	33904	0.80044	2.80792
1	with patch	18016	0.804868	1.624548	1088	0.763602	11.475409	34856	0.787108	2.80792
2	2 DDL without patch	19872	0.860348	2.73752	1632	0.763199	13.7254902	34560	0.806086	3.078703
2	with patch	20416	0.839065	2.73752	1856	0.733147	13.7254902	35624	0.829281	3.078703
3	3 DDL without patch	22016	0.894891	3.63372093	2288	0.776871	14.685314	35216	0.803493	3.339391186
3	with patch	22816	0.828028	3.63372093	2624	0.737177	14.685314	36392	0.800194	3.339391186
4	4 DDL without patch	24160	0.901686	4.4701986	2944	0.768445	15.217391	35872	0.77489	3.590544
4	with patch	25240	0.887143	4.4701986	3392	0.768382	15.217391	37160	0.82777	3.590544
5	5 DDL without patch	26328	0.901686	4.9832877	3600	0.751879	15.555555	36528	0.817928	3.832676
5	with patch	27640	0.914078	4.9832877	4160	0.74709	15.555555	37928	0.820621	3.832676
6	6 DDL without patch	28472	0.936385	5.5071649	4256	0.745179	15.78947368	37184	0.797043	4.066265
6	with patch	30040	0.958226	5.5071649	4928	0.725321	15.78947368	38696	0.814535	4.066265
7	8 DDL without patch	32760	1.0022203	6.422466	5568	0.757468	16.091954	38496	0.83207	4.509559
7	with patch	34864	0.966777	6.422466	6464	0.769072	16.091954	40232	0.903604	4.509559
8	11 DDL without patch	50296	1.0022203	5.662478	7536	0.748332	16.666666	40464	0.822266	5.179913
8	with patch	53144	0.966777	5.662478	8792	0.750553	16.666666	42560	0.797133	5.179913
9	15 DDL without patch	58896	1.267253	5.662478	10184	0.776875	16.496465	43112	0.821916	5.84524
9	with patch	62768	1.27234	5.662478	11864	0.746844	16.496465	45632	0.812567	5.84524
10	1 DDL & 3 DML without patch	18240	0.812551	1.6228	1192	0.771993	10.067114	34120	0.849467	2.8113599
10	with patch	18536	0.819089	1.6228	1312	0.785117	10.067114	35080	0.855456	2.8113599
11	3 DDL & 5 DML without patch	23656	0.926616	3.4832606	2656	0.758029	13.55421687	35584	0.829377	3.372302
11	with patch	24480	0.915517	3.4832606	3016	0.797206	13.55421687	36784	0.839176	3.372302
12	10 DDL & 5 DML without patch	52760	1.101005	4.958301744	7288	0.763065	16.02634468	40216	0.837843	4.993037
12	with patch	55376	1.105241	4.958301744	8456	0.779257	16.02634468	42224	0.835206	4.993037
13	10 DML without patch	1008	0.791091	6.349206	1008	0.81105	6.349206	1008	0.78817	6.349206
13	with patch	1072	0.807875	6.349206	1072	0.771113	6.349206	1072	0.759789	6.349206

To see all operations, please see[2] test_results

Summary:
Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation. I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast.

[1]: https://www.postgresql.org/message-id/CAFiTN-vnnrk580ucZVYnub_UQ-ayROew8fQ2Yn5aFYMeF0U03w%40mail.gmail.com
[2]: https://docs.google.com/spreadsheets/d/1g11MrSd_I39505OnGoLFVslz3ykbZ1nmfR_gUiE_O9k/edit?usp=sharing

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

27 мая 2020 г., 17:52:23

On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 25, 2020 at 8:48 PM Erik Rijkers <er@xs4all.nl> wrote:
> >
>
> > Hi,
> >
> > I am not able to extract all files correctly from this tar.
> >
> > The first file v24-0001-* seems to have some 'binary' junk at the top.
> >
> > (The other 11 files seem normally readably)
>
> Okay, sending again.

While reviewing/testing I have found a couple of problems in 0005 and
0006 which I have fixed in the attached version.

In 0005:  Basically, in the latest version, we are starting a stream
or begin txn only if there are any changes because we are doing in the
while loop, so we need to stream_stop/commit also if we have started
the stream.

In 0006: If we are streaming the serialized changed and there are
still few incomplete changes, then currently we are not deleting the
spilled file, but the spill file contains all the changes of the
transaction because there is no way to partially truncate it.  So in
the next stream, it will try to resend those.  I have fixed this by
sending the spilled transaction as soon as its changes are complete so
ideally, we can always delete the spilled file.  It is also a better
solution because this transaction is already spilled once and that
happened because we could not stream it,  so we better stream it on
the first opportunity that will reduce the replay lag which is our
whole purpose here.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v25.tar

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

28 мая 2020 г., 10:16:08

On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 6:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, May 18, 2020 at 4:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sun, May 17, 2020 at 12:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Fri, May 15, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Review comments:
> > > > > ------------------------------
> > > > > 1.
> > > > > @@ -1762,10 +1952,16 @@ ReorderBufferCommit(ReorderBuffer *rb,
> > > > > TransactionId xid,
> > > > >   }
> > > > >
> > > > >   case REORDER_BUFFER_CHANGE_MESSAGE:
> > > > > - rb->message(rb, txn, change->lsn, true,
> > > > > - change->data.msg.prefix,
> > > > > - change->data.msg.message_size,
> > > > > - change->data.msg.message);
> > > > > + if (streaming)
> > > > > + rb->stream_message(rb, txn, change->lsn, true,
> > > > > +    change->data.msg.prefix,
> > > > > +    change->data.msg.message_size,
> > > > > +    change->data.msg.message);
> > > > > + else
> > > > > + rb->message(rb, txn, change->lsn, true,
> > > > > +    change->data.msg.prefix,
> > > > > +    change->data.msg.message_size,
> > > > > +    change->data.msg.message);
> > > > >
> > > > > Don't we need to set any_data_sent flag while streaming messages as we
> > > > > do for other types of changes?
> > > >
> > > > I think any_data_sent, was added to avoid sending abort to the
> > > > subscriber if we haven't sent any data,  but this is not complete as
> > > > the output plugin can also take the decision not to send.  So I think
> > > > this should not be done as part of this patch and can be done
> > > > separately.  I think there is already a thread for handling the
> > > > same[1]
> > > >
> > >
> > > Hmm, but prior to this patch, we never use to send (empty) aborts but
> > > now that will be possible. It is probably okay to deal that with
> > > another patch mentioned by you but I felt at least any_data_sent will
> > > work for some cases.  OTOH, it appears to be half-baked solution, so
> > > we should probably refrain from adding it.  BTW, how do the pgoutput
> > > plugin deal with it? I see that apply_handle_stream_abort will
> > > unconditionally try to unlink the file and it will probably fail.
> > > Have you tested this scenario after your latest changes?
> >
> > Yeah, I see, I think this is a problem,  but this exists without my
> > latest change as well, if pgoutput ignore some changes because it is
> > not published then we will see a similar error.  Shall we handle the
> > ENOENT error case from unlink?
> Isn't this problem only for subxact file as we anyway create changes
> file as part of start stream message which should have come after
> abort?  If so, can't we detect whether subxact file exists probably by
> using nsubxacts or something like that?  Can you please once try to
> reproduce this scenario to ensure that we are not missing anything?

I have tested this, as of now, by default we create both changes and
subxact files irrespective of whether we get any subtransactions or
not.  Maybe this could be optimized that only if we have any subxact
then only create that file otherwise not?  What's your opinion on the
same.

> > > > > 8.
> > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > > *rb, TransactionId xid,
> > > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > > >
> > > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > +
> > > > > + /*
> > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > > + * if one of its children has.
> > > > > + */
> > > > > + if (txn->toptxn != NULL)
> > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > >  }
> > > > >
> > > > > Why are we marking top transaction here?
> > > >
> > > > We need to mark top transaction to decide whether to build tuplecid
> > > > hash or not.  In non-streaming mode, we are only sending during the
> > > > commit time, and during commit time we know whether the top
> > > > transaction has any catalog changes or not based on the invalidation
> > > > message so we are marking the top transaction there in DecodeCommit.
> > > > Since here we are not waiting till commit so we need to mark the top
> > > > transaction as soon as we mark any of its child transactions.
> > > >
> > >
> > > But how does it help?  We use this flag (via
> > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > > anyway done in DecodeCommit and that too after setting this flag for
> > > the top transaction if required.  So, how will it help in setting it
> > > while processing for subxid.  Also, even if we have to do it won't it
> > > add the xid needlessly in builder->committed.xip array?
> >
> > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> > to build the tuplecid hash or not based on whether it has catalog
> > changes or not.
> >
>
> Okay, but you haven't answered the second part of the question: "won't
> it add the xid of top transaction needlessly in builder->committed.xip
> array, see function SnapBuildCommitTxn?"  IIUC, this can happen
> without patch as well because DecodeCommit also sets the flags just
> based on invalidation messages irrespective of whether the messages
> are generated by top transaction or not, is that right?

Yes, with or without the patch it always adds the topxid.  I think
purpose for doing this with/without patch is not for the snapshot
instead we are marking the top itself that some of its subtxn has the
catalog changes so that while building the tuplecid has we can know
whether to build the hash or not.  But, having said that I feel in
ReorderBufferBuildTupleCidHash why do we need these two checks
if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
return;

I mean it should be enough to just have the check,  because if we have
added something to the tuplecids then catalog changes must be there
because that time we are setting the catalog changes to true.

if (dlist_is_empty(&txn->tuplecids))
return;

I think in the base code there are multiple things going on
1. If we get new CID we always set the catalog change in that
transaction but add the tuplecids in the top transaction.  So
basically, top transaction is so far not marked with catalog changes
but it has tuplecids.
2. Now, in DecodeCommit the top xid will be marked that it has catalog
changes based on the invalidation messages.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

28 мая 2020 г., 10:27:29

On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 4.
> > > + * XXX Do we need to allocate it in TopMemoryContext?
> > > + */
> > > +static void
> > > +subxact_info_add(TransactionId xid)
> > > {
> > > ..
> > >
> > > For this and other places in a patch like in function
> > > stream_open_file(), instead of using TopMemoryContext, can we consider
> > > using a new memory context LogicalStreamingContext or something like
> > > that. We can create LogicalStreamingContext under TopMemoryContext.  I
> > > don't see any need of using TopMemoryContext here.
> >
> > But, when we will delete/reset the LogicalStreamingContext?
> >
>
> Why can't we reset it at each stream stop message?
> >  because
> > we are planning to keep this memory until the worker is alive so that
> > supposed to be the top memory context.
> >
>
> Which part of allocation do we want to keep till the worker is alive?

static TransactionId *xids = NULL; we need to keep till worker life space.

> Why we need memory-related to subxacts till the worker is alive?  As
> we have now, after reading subxact info (subxact_info_read), we need
> to ensure that it is freed after its usage due to which we need to
> remember and perform pfree at various places.
>
> I think we should once see the possibility that such that we could
> switch to this new context in start stream message and reset it in
> stop stream message.  That might help in avoiding
> MemoryContextSwitchTo TopMemoryContext at various places.

Ok, I understand, I think subxacts can be allocated in new
LogicalStreamingContext which we can reset at the stream stop.  How
about xids?
shall we create another context that will stay until the worker lifespan?

> >  If we create any other context
> > with the same life span as TopMemoryContext then what is the point?
>>
>
> It is helpful for debugging.  It is recommended that we don't use the
> top memory context unless it is really required.  Read about it in
> src/backend/utils/mmgr/README.

I see.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

28 мая 2020 г., 12:11:04

On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Isn't this problem only for subxact file as we anyway create changes
> > file as part of start stream message which should have come after
> > abort?  If so, can't we detect whether subxact file exists probably by
> > using nsubxacts or something like that?  Can you please once try to
> > reproduce this scenario to ensure that we are not missing anything?
>
> I have tested this, as of now, by default we create both changes and
> subxact files irrespective of whether we get any subtransactions or
> not.  Maybe this could be optimized that only if we have any subxact
> then only create that file otherwise not?  What's your opinion on the
> same.
>

Yeah, that makes sense.

> > > > > > 8.
> > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > > > *rb, TransactionId xid,
> > > > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > > > >
> > > > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > +
> > > > > > + /*
> > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > > > + * if one of its children has.
> > > > > > + */
> > > > > > + if (txn->toptxn != NULL)
> > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > >  }
> > > > > >
> > > > > > Why are we marking top transaction here?
> > > > >
> > > > > We need to mark top transaction to decide whether to build tuplecid
> > > > > hash or not.  In non-streaming mode, we are only sending during the
> > > > > commit time, and during commit time we know whether the top
> > > > > transaction has any catalog changes or not based on the invalidation
> > > > > message so we are marking the top transaction there in DecodeCommit.
> > > > > Since here we are not waiting till commit so we need to mark the top
> > > > > transaction as soon as we mark any of its child transactions.
> > > > >
> > > >
> > > > But how does it help?  We use this flag (via
> > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > > > anyway done in DecodeCommit and that too after setting this flag for
> > > > the top transaction if required.  So, how will it help in setting it
> > > > while processing for subxid.  Also, even if we have to do it won't it
> > > > add the xid needlessly in builder->committed.xip array?
> > >
> > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> > > to build the tuplecid hash or not based on whether it has catalog
> > > changes or not.
> > >
> >
> > Okay, but you haven't answered the second part of the question: "won't
> > it add the xid of top transaction needlessly in builder->committed.xip
> > array, see function SnapBuildCommitTxn?"  IIUC, this can happen
> > without patch as well because DecodeCommit also sets the flags just
> > based on invalidation messages irrespective of whether the messages
> > are generated by top transaction or not, is that right?
>
> Yes, with or without the patch it always adds the topxid.  I think
> purpose for doing this with/without patch is not for the snapshot
> instead we are marking the top itself that some of its subtxn has the
> catalog changes so that while building the tuplecid has we can know
> whether to build the hash or not.  But, having said that I feel in
> ReorderBufferBuildTupleCidHash why do we need these two checks
> if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> return;
>
> I mean it should be enough to just have the check,  because if we have
> added something to the tuplecids then catalog changes must be there
> because that time we are setting the catalog changes to true.
>
> if (dlist_is_empty(&txn->tuplecids))
> return;
>
> I think in the base code there are multiple things going on
> 1. If we get new CID we always set the catalog change in that
> transaction but add the tuplecids in the top transaction.  So
> basically, top transaction is so far not marked with catalog changes
> but it has tuplecids.
> 2. Now, in DecodeCommit the top xid will be marked that it has catalog
> changes based on the invalidation messages.
>

I don't think it is advisable to remove that check from base code
unless we have a strong reason for doing so.  I think here you can
write better comments about why you are marking the flag for top
transaction and remove TOCHECK from the comment.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

28 мая 2020 г., 12:45:12

On Thu, May 28, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Why we need memory-related to subxacts till the worker is alive?  As
> > we have now, after reading subxact info (subxact_info_read), we need
> > to ensure that it is freed after its usage due to which we need to
> > remember and perform pfree at various places.
> >
> > I think we should once see the possibility that such that we could
> > switch to this new context in start stream message and reset it in
> > stop stream message.  That might help in avoiding
> > MemoryContextSwitchTo TopMemoryContext at various places.
>
> Ok, I understand, I think subxacts can be allocated in new
> LogicalStreamingContext which we can reset at the stream stop.  How
> about xids?
>

How about storing xids in ApplyContext?  We do store similar lifespan
things in that context, for ex. see store_flush_position.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

28 мая 2020 г., 13:09:44

On Thu, May 28, 2020 at 3:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Why we need memory-related to subxacts till the worker is alive?  As
> > > we have now, after reading subxact info (subxact_info_read), we need
> > > to ensure that it is freed after its usage due to which we need to
> > > remember and perform pfree at various places.
> > >
> > > I think we should once see the possibility that such that we could
> > > switch to this new context in start stream message and reset it in
> > > stop stream message.  That might help in avoiding
> > > MemoryContextSwitchTo TopMemoryContext at various places.
> >
> > Ok, I understand, I think subxacts can be allocated in new
> > LogicalStreamingContext which we can reset at the stream stop.  How
> > about xids?
> >
>
> How about storing xids in ApplyContext?  We do store similar lifespan
> things in that context, for ex. see store_flush_position.

That sounds good to me,   I will make this change in the next patch
set, along with other changes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

28 мая 2020 г., 13:12:23

On Thu, May 28, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Isn't this problem only for subxact file as we anyway create changes
> > > file as part of start stream message which should have come after
> > > abort?  If so, can't we detect whether subxact file exists probably by
> > > using nsubxacts or something like that?  Can you please once try to
> > > reproduce this scenario to ensure that we are not missing anything?
> >
> > I have tested this, as of now, by default we create both changes and
> > subxact files irrespective of whether we get any subtransactions or
> > not.  Maybe this could be optimized that only if we have any subxact
> > then only create that file otherwise not?  What's your opinion on the
> > same.
> >
>
> Yeah, that makes sense.
>
> > > > > > > 8.
> > > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > > > > *rb, TransactionId xid,
> > > > > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > > > > >
> > > > > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > > > > + * if one of its children has.
> > > > > > > + */
> > > > > > > + if (txn->toptxn != NULL)
> > > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > >  }
> > > > > > >
> > > > > > > Why are we marking top transaction here?
> > > > > >
> > > > > > We need to mark top transaction to decide whether to build tuplecid
> > > > > > hash or not.  In non-streaming mode, we are only sending during the
> > > > > > commit time, and during commit time we know whether the top
> > > > > > transaction has any catalog changes or not based on the invalidation
> > > > > > message so we are marking the top transaction there in DecodeCommit.
> > > > > > Since here we are not waiting till commit so we need to mark the top
> > > > > > transaction as soon as we mark any of its child transactions.
> > > > > >
> > > > >
> > > > > But how does it help?  We use this flag (via
> > > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > > > > anyway done in DecodeCommit and that too after setting this flag for
> > > > > the top transaction if required.  So, how will it help in setting it
> > > > > while processing for subxid.  Also, even if we have to do it won't it
> > > > > add the xid needlessly in builder->committed.xip array?
> > > >
> > > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> > > > to build the tuplecid hash or not based on whether it has catalog
> > > > changes or not.
> > > >
> > >
> > > Okay, but you haven't answered the second part of the question: "won't
> > > it add the xid of top transaction needlessly in builder->committed.xip
> > > array, see function SnapBuildCommitTxn?"  IIUC, this can happen
> > > without patch as well because DecodeCommit also sets the flags just
> > > based on invalidation messages irrespective of whether the messages
> > > are generated by top transaction or not, is that right?
> >
> > Yes, with or without the patch it always adds the topxid.  I think
> > purpose for doing this with/without patch is not for the snapshot
> > instead we are marking the top itself that some of its subtxn has the
> > catalog changes so that while building the tuplecid has we can know
> > whether to build the hash or not.  But, having said that I feel in
> > ReorderBufferBuildTupleCidHash why do we need these two checks
> > if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > return;
> >
> > I mean it should be enough to just have the check,  because if we have
> > added something to the tuplecids then catalog changes must be there
> > because that time we are setting the catalog changes to true.
> >
> > if (dlist_is_empty(&txn->tuplecids))
> > return;
> >
> > I think in the base code there are multiple things going on
> > 1. If we get new CID we always set the catalog change in that
> > transaction but add the tuplecids in the top transaction.  So
> > basically, top transaction is so far not marked with catalog changes
> > but it has tuplecids.
> > 2. Now, in DecodeCommit the top xid will be marked that it has catalog
> > changes based on the invalidation messages.
> >
>
> I don't think it is advisable to remove that check from base code
> unless we have a strong reason for doing so.  I think here you can
> write better comments about why you are marking the flag for top
> transaction and remove TOCHECK from the comment.

Ok, I will do that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

28 мая 2020 г., 14:52:13

On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > Okay, sending again.
>
> While reviewing/testing I have found a couple of problems in 0005 and
> 0006 which I have fixed in the attached version.
>

I haven't reviewed the new fixes yet but I have some comments on
0008-Add-support-for-streaming-to-built-in-replicatio.patch.
1.
I think the temporary files (and or handles) used for storing the
information of changes and subxacts are getting leaked in the patch.
At some places, it is taken care to close the file but cases like
apply_handle_stream_commit where if any error occurred in
apply_dispatch(), the file might not get closed.  The other place is
in apply_handle_stream_abort() where if there is an error in ftruncate
the file won't be closed.   Now, the bigger problem is with changes
related file which is opened in apply_handle_stream_start and closed
in apply_handle_stream_stop and if there is any error in-between, we
won't close it.

OTOH, I think the worker will exit on an error so it might not matter
but then why we are at few other places we are closing it before the
error?  I think on error these temporary files should be removed
instead of relying on them to get removed next time when we receive
changes for the same transaction which I feel is what we do in other
cases where we use temporary files like for sorts or hashjoins.

Also, what if the changes file size overflows "OS file size limit"?
If we agree that the above are problems then do you think we should
explore using BufFile interface (see storage/file/buffile.c) to avoid
all such problems?

2.
apply_handle_stream_abort()
{
..
+ /* discard the subxacts added later */
+ nsubxacts = subidx;
+
+ /* write the updated subxact list */
+ subxact_info_write(MyLogicalRepWorker->subid, xid);
..
}

Here, if subxacts becomes zero, then also subxact_info_write will
create a new file and write checksum.  I think subxact_info_write
should have a check for nsubxacts > 0 before writing to the file.

3.
apply_handle_stream_commit(StringInfo s)
{
..
+ /*
+ * send feedback to upstream
+ *
+ * XXX Probably should send a valid LSN. But which one?
+ */
+ send_feedback(InvalidXLogRecPtr, false, false);
..
}

Why do we need to send the feedback at this stage after applying each
message?  If we see a non-streamed case, we never send_feedback after
each message. So, following that, I don't see the need to send it here
but if you see any specific reason then do let me know?  And if we
have to send feedback, then we need to decide the appropriate values
as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

29 мая 2020 г., 12:24:11

On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> While reviewing/testing I have found a couple of problems in 0005 and
> 0006 which I have fixed in the attached version.
>
..
>
> In 0006: If we are streaming the serialized changed and there are
> still few incomplete changes, then currently we are not deleting the
> spilled file, but the spill file contains all the changes of the
> transaction because there is no way to partially truncate it.  So in
> the next stream, it will try to resend those.  I have fixed this by
> sending the spilled transaction as soon as its changes are complete so
> ideally, we can always delete the spilled file.  It is also a better
> solution because this transaction is already spilled once and that
> happened because we could not stream it,  so we better stream it on
> the first opportunity that will reduce the replay lag which is our
> whole purpose here.
>

I have reviewed these changes (in the patch
v25-0006-Bugfix-handling-of-incomplete-toast-spec-insert-) and below
are my comments.

1.
+ /*
+ * If the transaction is serialized and the the changes are complete in
+ * the top level transaction then immediately stream the transaction.
+ * The reason for not waiting for memory limit to get full is that in
+ * the streaming mode, if the transaction serialized that means we have
+ * already reached the memory limit but that time we could not stream
+ * this due to incomplete tuple so now stream it as soon as the tuple
+ * is complete.
+ */
+ if (rbtxn_is_serialized(txn))
+ ReorderBufferStreamTXN(rb, toptxn);

I think here it is important to explain why it is a must to stream a
prior serialized transaction as otherwise, later we won't be able to
know how to truncate a file.

2.
+ * If complete_truncate is set we completely truncate the transaction,
+ * otherwise we truncate upto last_complete_lsn if the transaction has
+ * incomplete changes.  Basically, complete_truncate is passed true only if
+ * concurrent abort is detected while processing the TXN.
  */
 static void
-ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
+ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
+ bool partial_truncate)
 {

The description talks about complete_truncate flag whereas API is
using partial_truncate flag.  I think the description needs to be
changed.

3.
+ /* We have truncated upto last complete lsn so stop. */
+ if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
+ (change->lsn > toptxn->last_complete_lsn))
+ {
+ /*
+ * If this is a top transaction then we can reset the
+ * last_complete_lsn and complete_size, because by now we would
+ * have stream all the changes upto last_complete_lsn.
+ */
+ if (txn->toptxn == NULL)
+ {
+ toptxn->last_complete_lsn = InvalidXLogRecPtr;
+ toptxn->complete_size = 0;
+ }
+ break;
+ }

I think here we can add an Assert to ensure that we don't partially
truncate when the transaction is serialized and add comments for the
same.

4.
+ /*
+ * Subtract the processed changes from the nentries/nentries_mem Refer
+ * detailed comment atop this variable in ReorderBufferTXN structure.
+ * We do this only ff we are truncating the partial changes otherwise
+ * reset these values directly to 0.
+ */
+ if (partial_truncate)
+ {
+ txn->nentries -= txn->nprocessed;
+ txn->nentries_mem -= txn->nprocessed;
+ }
+ else
+ {
+ txn->nentries = 0;
+ txn->nentries_mem = 0;
+ }

I think we can write this comment as "Adjust nentries/nentries_mem
based on the changes processed.  See comments where nprocessed is
declared."

5.
+ /*
+ * In streaming mode, sometime we can't stream all the changes due to the
+ * incomplete changes.  So we can not directly reset the values of
+ * nentries/nentries_mem to 0 after one stream is sent like we do in
+ * non-streaming mode.  So while sending one stream we keep count of the
+ * changes processed in thi stream and only those many changes we decrement
+ * from the nentries/nentries_mem.
+ */
+ uint64 nprocessed;

How about something like: "Number of changes processed.  This is used
to keep track of changes that remained to be streamed.  As of now,
this can happen either due to toast tuples or speculative insertions
where we need to wait for multiple changes before we can send them."

6.
+ /* Size of the commplete changes. */
+ Size complete_size;

Typo. /commplete/complete

7.
+ /*
+ * Increment the nprocessed count.  See the detailed comment
+ * for usage of this in ReorderBufferTXN structure.
+ */
+ change->txn->nprocessed++;

Ideally, this has to be incremented after processing the change.  So,
we can combine it with existing check in the patch as below:

if (streaming)
{
   change->txn->nprocessed++;

  if (rbtxn_has_incomplete_tuple(txn) &&
prev_lsn == txn->last_complete_lsn)
{
/* Only in streaming mode we should get here. */
Assert(streaming);
partial_truncate = true;
break;
}
}

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

29 мая 2020 г., 13:22:19

On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:

Hi all,
On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;

Test parameters:
wal_level= 'logical
max_connections = '150'
wal_receiver_timeout = '600s'
max_wal_size = '2GB'
min_wal_size = '2GB'
autovacuum= 'off'
checkpoint_timeout= '1d'

Test results:

CREATE index operations Add col int(date) operations Add col text operations
SN. operation name LSN diff (in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) % LSN change LSN diff (in bytes) time (in sec) % LSN change
1
1 DDL without patch 17728 0.89116
1.624548
976 0.764393
11.475409
33904 0.80044
2.80792
with patch 18016 0.804868 1088 0.763602 34856 0.787108
2
2 DDL without patch 19872 0.860348
2.73752
1632 0.763199
13.7254902
34560 0.806086
3.078703
with patch 20416 0.839065 1856 0.733147 35624 0.829281
3
3 DDL without patch 22016 0.894891
3.63372093
2288 0.776871
14.685314
35216 0.803493
3.339391186
with patch 22816 0.828028 2624 0.737177 36392 0.800194
4
4 DDL without patch 24160 0.901686
4.4701986
2944 0.768445
15.217391
35872 0.77489
3.590544
with patch 25240 0.887143 3392 0.768382 37160 0.82777
5
5 DDL without patch 26328 0.901686
4.9832877
3600 0.751879
15.555555
36528 0.817928
3.832676
with patch 27640 0.914078 4160 0.74709 37928 0.820621
6
6 DDL without patch 28472 0.936385
5.5071649
4256 0.745179
15.78947368
37184 0.797043
4.066265
with patch 30040 0.958226 4928 0.725321 38696 0.814535
7
8 DDL without patch 32760 1.0022203
6.422466
5568 0.757468
16.091954
38496 0.83207
4.509559
with patch 34864 0.966777 6464 0.769072 40232 0.903604
8
11 DDL without patch 50296 1.0022203
5.662478
7536 0.748332
16.666666
40464 0.822266
5.179913
with patch 53144 0.966777 8792 0.750553 42560 0.797133
9
15 DDL without patch 58896 1.267253
5.662478
10184 0.776875
16.496465
43112 0.821916
5.84524
with patch 62768 1.27234 11864 0.746844 45632 0.812567
10
1 DDL & 3 DML without patch 18240 0.812551
1.6228
1192 0.771993
10.067114
34120 0.849467
2.8113599
with patch 18536 0.819089 1312 0.785117 35080 0.855456
11
3 DDL & 5 DML without patch 23656 0.926616
3.4832606
2656 0.758029
13.55421687
35584 0.829377
3.372302
with patch 24480 0.915517 3016 0.797206 36784 0.839176
12
10 DDL & 5 DML without patch 52760 1.101005
4.958301744
7288 0.763065
16.02634468
40216 0.837843
4.993037
with patch 55376 1.105241 8456 0.779257 42224 0.835206
13
10 DML without patch 1008 0.791091
6.349206
1008 0.81105
6.349206
1008 0.78817
6.349206
with patch 1072 0.807875 1072 0.771113 1072 0.759789

To see all operations, please see[2] test_results

Why are you seeing any additional WAL in case-13 (10 DML) where there is no DDL? I think it is because you have used savepoints in that case which will add some additional WAL. You seems to have 9 savepoints in that test which should ideally generate 36 bytes of additional WAL (4-byte per transaction id for each subtransaction). Also, in other cases where you took data for DDL and DML, you have also used savepoints in those tests. I suggest for savepoints, let's do separate tests as you have done in case-13 but we can do it 3,5,7,10 savepoints and probably each transaction can update a row of 200 bytes or so.

I think you can take data for somewhat more realistic cases of DDL and DML combination like 3 DDL's with 10 DML and 3 DDL's with 15 DML operations. In general, I think we will see many more DML's per DDL. It is good to see the worst-case WAL and performance overhead as you have done.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 мая 2020 г., 18:00:42

On Fri, May 29, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > While reviewing/testing I have found a couple of problems in 0005 and
> > 0006 which I have fixed in the attached version.
> >
> ..
> >
> > In 0006: If we are streaming the serialized changed and there are
> > still few incomplete changes, then currently we are not deleting the
> > spilled file, but the spill file contains all the changes of the
> > transaction because there is no way to partially truncate it.  So in
> > the next stream, it will try to resend those.  I have fixed this by
> > sending the spilled transaction as soon as its changes are complete so
> > ideally, we can always delete the spilled file.  It is also a better
> > solution because this transaction is already spilled once and that
> > happened because we could not stream it,  so we better stream it on
> > the first opportunity that will reduce the replay lag which is our
> > whole purpose here.
> >
>
> I have reviewed these changes (in the patch
> v25-0006-Bugfix-handling-of-incomplete-toast-spec-insert-) and below
> are my comments.
>
> 1.
> + /*
> + * If the transaction is serialized and the the changes are complete in
> + * the top level transaction then immediately stream the transaction.
> + * The reason for not waiting for memory limit to get full is that in
> + * the streaming mode, if the transaction serialized that means we have
> + * already reached the memory limit but that time we could not stream
> + * this due to incomplete tuple so now stream it as soon as the tuple
> + * is complete.
> + */
> + if (rbtxn_is_serialized(txn))
> + ReorderBufferStreamTXN(rb, toptxn);
>
> I think here it is important to explain why it is a must to stream a
> prior serialized transaction as otherwise, later we won't be able to
> know how to truncate a file.

Done

> 2.
> + * If complete_truncate is set we completely truncate the transaction,
> + * otherwise we truncate upto last_complete_lsn if the transaction has
> + * incomplete changes.  Basically, complete_truncate is passed true only if
> + * concurrent abort is detected while processing the TXN.
>   */
>  static void
> -ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> +ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
> + bool partial_truncate)
>  {
>
> The description talks about complete_truncate flag whereas API is
> using partial_truncate flag.  I think the description needs to be
> changed.

Fixed

> 3.
> + /* We have truncated upto last complete lsn so stop. */
> + if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
> + (change->lsn > toptxn->last_complete_lsn))
> + {
> + /*
> + * If this is a top transaction then we can reset the
> + * last_complete_lsn and complete_size, because by now we would
> + * have stream all the changes upto last_complete_lsn.
> + */
> + if (txn->toptxn == NULL)
> + {
> + toptxn->last_complete_lsn = InvalidXLogRecPtr;
> + toptxn->complete_size = 0;
> + }
> + break;
> + }
>
> I think here we can add an Assert to ensure that we don't partially
> truncate when the transaction is serialized and add comments for the
> same.

Done

> 4.
> + /*
> + * Subtract the processed changes from the nentries/nentries_mem Refer
> + * detailed comment atop this variable in ReorderBufferTXN structure.
> + * We do this only ff we are truncating the partial changes otherwise
> + * reset these values directly to 0.
> + */
> + if (partial_truncate)
> + {
> + txn->nentries -= txn->nprocessed;
> + txn->nentries_mem -= txn->nprocessed;
> + }
> + else
> + {
> + txn->nentries = 0;
> + txn->nentries_mem = 0;
> + }
>
> I think we can write this comment as "Adjust nentries/nentries_mem
> based on the changes processed.  See comments where nprocessed is
> declared."
>
> 5.
> + /*
> + * In streaming mode, sometime we can't stream all the changes due to the
> + * incomplete changes.  So we can not directly reset the values of
> + * nentries/nentries_mem to 0 after one stream is sent like we do in
> + * non-streaming mode.  So while sending one stream we keep count of the
> + * changes processed in thi stream and only those many changes we decrement
> + * from the nentries/nentries_mem.
> + */
> + uint64 nprocessed;
>
> How about something like: "Number of changes processed.  This is used
> to keep track of changes that remained to be streamed.  As of now,
> this can happen either due to toast tuples or speculative insertions
> where we need to wait for multiple changes before we can send them."

Done

> 6.
> + /* Size of the commplete changes. */
> + Size complete_size;
>
> Typo. /commplete/complete
>
> 7.
> + /*
> + * Increment the nprocessed count.  See the detailed comment
> + * for usage of this in ReorderBufferTXN structure.
> + */
> + change->txn->nprocessed++;
>
> Ideally, this has to be incremented after processing the change.  So,
> we can combine it with existing check in the patch as below:
>
> if (streaming)
> {
>    change->txn->nprocessed++;
>
>   if (rbtxn_has_incomplete_tuple(txn) &&
> prev_lsn == txn->last_complete_lsn)
> {
> /* Only in streaming mode we should get here. */
> Assert(streaming);
> partial_truncate = true;
> break;
> }
> }

Done

Apart from this, there was one more issue in this patch
+ if (partial_truncate && rbtxn_has_incomplete_tuple(toptxn) &&
+ (change->lsn > toptxn->last_complete_lsn))
+ {
+ /*
+ * If this is a top transaction then we can reset the
+ * last_complete_lsn and complete_size, because by now we would
+ * have stream all the changes upto last_complete_lsn.
+ */
+ if (txn->toptxn == NULL)
+ {
+ toptxn->last_complete_lsn = InvalidXLogRecPtr;
+ toptxn->complete_size = 0;
+ }
+ break;

We shall reset toptxn->last_complete_lsn and toptxn->complete_size,
outside this {(change->lsn > toptxn->last_complete_lsn)} check,
because we might be in subxact when we meet this condition, so in that
case, for toptxn we never reach here and it will never get reset, I
have fixed this.

Apart from this one more fix in 0005,  basically, CheckLiveXid was
never reset, so I have fixed that as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v26.tar

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 мая 2020 г., 18:01:09

On Thu, May 28, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 12:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Isn't this problem only for subxact file as we anyway create changes
> > > file as part of start stream message which should have come after
> > > abort?  If so, can't we detect whether subxact file exists probably by
> > > using nsubxacts or something like that?  Can you please once try to
> > > reproduce this scenario to ensure that we are not missing anything?
> >
> > I have tested this, as of now, by default we create both changes and
> > subxact files irrespective of whether we get any subtransactions or
> > not.  Maybe this could be optimized that only if we have any subxact
> > then only create that file otherwise not?  What's your opinion on the
> > same.
> >
>
> Yeah, that makes sense.
>
> > > > > > > 8.
> > > > > > > @@ -2295,6 +2677,13 @@ ReorderBufferXidSetCatalogChanges(ReorderBuffer
> > > > > > > *rb, TransactionId xid,
> > > > > > >   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > > > > >
> > > > > > >   txn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * TOCHECK: Mark toplevel transaction as having catalog changes too
> > > > > > > + * if one of its children has.
> > > > > > > + */
> > > > > > > + if (txn->toptxn != NULL)
> > > > > > > + txn->toptxn->txn_flags |= RBTXN_HAS_CATALOG_CHANGES;
> > > > > > >  }
> > > > > > >
> > > > > > > Why are we marking top transaction here?
> > > > > >
> > > > > > We need to mark top transaction to decide whether to build tuplecid
> > > > > > hash or not.  In non-streaming mode, we are only sending during the
> > > > > > commit time, and during commit time we know whether the top
> > > > > > transaction has any catalog changes or not based on the invalidation
> > > > > > message so we are marking the top transaction there in DecodeCommit.
> > > > > > Since here we are not waiting till commit so we need to mark the top
> > > > > > transaction as soon as we mark any of its child transactions.
> > > > > >
> > > > >
> > > > > But how does it help?  We use this flag (via
> > > > > ReorderBufferXidHasCatalogChanges) in SnapBuildCommitTxn which is
> > > > > anyway done in DecodeCommit and that too after setting this flag for
> > > > > the top transaction if required.  So, how will it help in setting it
> > > > > while processing for subxid.  Also, even if we have to do it won't it
> > > > > add the xid needlessly in builder->committed.xip array?
> > > >
> > > > In ReorderBufferBuildTupleCidHash, we use this flag to decide whether
> > > > to build the tuplecid hash or not based on whether it has catalog
> > > > changes or not.
> > > >
> > >
> > > Okay, but you haven't answered the second part of the question: "won't
> > > it add the xid of top transaction needlessly in builder->committed.xip
> > > array, see function SnapBuildCommitTxn?"  IIUC, this can happen
> > > without patch as well because DecodeCommit also sets the flags just
> > > based on invalidation messages irrespective of whether the messages
> > > are generated by top transaction or not, is that right?
> >
> > Yes, with or without the patch it always adds the topxid.  I think
> > purpose for doing this with/without patch is not for the snapshot
> > instead we are marking the top itself that some of its subtxn has the
> > catalog changes so that while building the tuplecid has we can know
> > whether to build the hash or not.  But, having said that I feel in
> > ReorderBufferBuildTupleCidHash why do we need these two checks
> > if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
> > return;
> >
> > I mean it should be enough to just have the check,  because if we have
> > added something to the tuplecids then catalog changes must be there
> > because that time we are setting the catalog changes to true.
> >
> > if (dlist_is_empty(&txn->tuplecids))
> > return;
> >
> > I think in the base code there are multiple things going on
> > 1. If we get new CID we always set the catalog change in that
> > transaction but add the tuplecids in the top transaction.  So
> > basically, top transaction is so far not marked with catalog changes
> > but it has tuplecids.
> > 2. Now, in DecodeCommit the top xid will be marked that it has catalog
> > changes based on the invalidation messages.
> >
>
> I don't think it is advisable to remove that check from base code
> unless we have a strong reason for doing so.  I think here you can
> write better comments about why you are marking the flag for top
> transaction and remove TOCHECK from the comment.

Done.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 мая 2020 г., 18:01:16

On Tue, May 26, 2020 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 26, 2020 at 2:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > 2. There is a bug fix in handling the stream abort in 0008 (earlier it
> > > > was 0006).
> > > >
> > >
> > > The code changes look fine but it is not clear what was the exact
> > > issue.  Can you explain?
> >
> > Basically, in case of an empty subtransaction, we were reading the
> > subxacts info but when we could not find the subxid in the subxacts
> > info we were not releasing the memory.  So on next subxact_info_read
> > it will expect that subxacts should be freed but we did not free it in
> > that !found case.
> >
>
> Okay, on looking at it again, the same code exists in
> subxact_info_write as well.  It is better to have a function for it.
> Can we have a structure like SubXactContext for all the variables used
> for subxact?  As mentioned earlier I find the allocation/deallocation
> of subxacts a bit ad-hoc, so there will always be a chance that we can
> forget to free it.  Having it allocated in memory context which we can
> reset later might reduce that risk.  One idea could be that we have a
> special memory context for start and stop messages which can be used
> to allocate the subxacts there.  In case of commit/abort, we can allow
> subxacts information to be allocated in ApplyMessageContext which is
> reset at the end of each protocol message.

Changed as per this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 мая 2020 г., 18:01:21

On Tue, May 26, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 25, 2020 at 8:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, May 22, 2020 at 11:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 4.
> > > + * XXX Do we need to allocate it in TopMemoryContext?
> > > + */
> > > +static void
> > > +subxact_info_add(TransactionId xid)
> > > {
> > > ..
> > >
> > > For this and other places in a patch like in function
> > > stream_open_file(), instead of using TopMemoryContext, can we consider
> > > using a new memory context LogicalStreamingContext or something like
> > > that. We can create LogicalStreamingContext under TopMemoryContext.  I
> > > don't see any need of using TopMemoryContext here.
> >
> > But, when we will delete/reset the LogicalStreamingContext?
> >
>
> Why can't we reset it at each stream stop message?

Done this

>
> >  because
> > we are planning to keep this memory until the worker is alive so that
> > supposed to be the top memory context.
> >
>
> Which part of allocation do we want to keep till the worker is alive?
> Why we need memory-related to subxacts till the worker is alive?  As
> we have now, after reading subxact info (subxact_info_read), we need
> to ensure that it is freed after its usage due to which we need to
> remember and perform pfree at various places.
>
> I think we should once see the possibility that such that we could
> switch to this new context in start stream message and reset it in
> stop stream message.  That might help in avoiding
> MemoryContextSwitchTo TopMemoryContext at various places.
>
> >  If we create any other context
> > with the same life span as TopMemoryContext then what is the point?
> >
>
> It is helpful for debugging.  It is recommended that we don't use the
> top memory context unless it is really required.  Read about it in
> src/backend/utils/mmgr/README.

xids is now allocated in ApplyContext

> > > 8.
> > > + * XXX Maybe we should only include the checksum when the cluster is
> > > + * initialized with checksums?
> > > + */
> > > +static void
> > > +subxact_info_write(Oid subid, TransactionId xid)
> > >
> > > Do we really need to have the checksum for temporary files? I have
> > > checked a few other similar cases like SharedFileSet stuff for
> > > parallel hash join but didn't find them using checksums.  Can you also
> > > once see other usages of temporary files and then let us decide if we
> > > see any reason to have checksums for this?
> >
> > Yeah, even I can see other places checksum is not used.
> >
>
> So, unless someone speaks up before you are ready for the next version
> of the patch, can we remove it?

Done
> > > Another point is we don't seem to be doing this for 'changes' file,
> > > see stream_write_change.  So, not sure, there is any sense to write
> > > checksum for subxact file.
> >
> > I can see there are comment atop this function
> >
> > * XXX The subxact file includes CRC32C of the contents. Maybe we should
> > * include something like that here too, but doing so will not be as
> > * straighforward, because we write the file in chunks.
> >
>
> You can remove this comment as well.  I don't know how advantageous it
> is to checksum temporary files.  We can anyway add it later if there
> is a reason for doing so.

Done

> >
> > > 12.
> > > maybe_send_schema()
> > > {
> > > ..
> > > + if (in_streaming)
> > > + {
> > > + /*
> > > + * TOCHECK: We have to send schema after each catalog change and it may
> > > + * occur when streaming already started, so we have to track new catalog
> > > + * changes somehow.
> > > + */
> > > + schema_sent = get_schema_sent_in_streamed_txn(relentry, topxid);
> > > ..
> > > ..
> > > }
> > >
> > > I think it is good to once verify/test what this comment says but as
> > > per code we should be sending the schema after each catalog change as
> > > we invalidate the streamed_txns list in rel_sync_cache_relation_cb
> > > which must be called during relcache invalidation.  Do we see any
> > > problem with that mechanism?
> >
> > I have tested this, I think we are already sending the schema after
> > each catalog change.
> >
>
> Then remove "TOCHECK" in the above comment.

Done


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

02 июня 2020 г., 13:28:57

On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 8:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, May 26, 2020 at 7:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > >
> > > Okay, sending again.
> >
> > While reviewing/testing I have found a couple of problems in 0005 and
> > 0006 which I have fixed in the attached version.
> >
>
> I haven't reviewed the new fixes yet but I have some comments on
> 0008-Add-support-for-streaming-to-built-in-replicatio.patch.
> 1.
> I think the temporary files (and or handles) used for storing the
> information of changes and subxacts are getting leaked in the patch.
> At some places, it is taken care to close the file but cases like
> apply_handle_stream_commit where if any error occurred in
> apply_dispatch(), the file might not get closed.  The other place is
> in apply_handle_stream_abort() where if there is an error in ftruncate
> the file won't be closed.   Now, the bigger problem is with changes
> related file which is opened in apply_handle_stream_start and closed
> in apply_handle_stream_stop and if there is any error in-between, we
> won't close it.
>
> OTOH, I think the worker will exit on an error so it might not matter
> but then why we are at few other places we are closing it before the
> error?  I think on error these temporary files should be removed
> instead of relying on them to get removed next time when we receive
> changes for the same transaction which I feel is what we do in other
> cases where we use temporary files like for sorts or hashjoins.
>
> Also, what if the changes file size overflows "OS file size limit"?
> If we agree that the above are problems then do you think we should
> explore using BufFile interface (see storage/file/buffile.c) to avoid
> all such problems?

I also think that the file size is a problem.  I think we can use
BufFile with some modifications.  We can not use the
BufFileCreateTemp, because of few reasons
1) files get deleted on close, but we have to open/close on every
stream start/stop.
2) even if we try to avoid closing we need to the BufFile pointers
(which take 8192k per file) because there is no option to pass the
file name.

I thin for our use case BufFileCreateShared is more suitable.  I think
we need to do some modifications so that we can use these apps without
SharedFileSet. Otherwise, we need to unnecessarily need to create
SharedFileSet for each transaction and also need to maintain it in xid
array or xid hash until transaction commit/abort.  So I suggest
following modifications in shared files set so that we can
conveniently use it.
1. ChooseTablespace(const SharedFileSet fileset, const char name)
  if fileset is NULL then select the DEFAULTTABLESPACEOID
2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
If fileset is NULL then in directory path we can use MyProcPID or
something instead of fileset->creator_pid.
3. Pass some parameter to BufFileOpenShared, so that it can open the
file in RW mode instead of read-only mode.

> 2.
> apply_handle_stream_abort()
> {
> ..
> + /* discard the subxacts added later */
> + nsubxacts = subidx;
> +
> + /* write the updated subxact list */
> + subxact_info_write(MyLogicalRepWorker->subid, xid);
> ..
> }
>
> Here, if subxacts becomes zero, then also subxact_info_write will
> create a new file and write checksum.

How, will it create the new file, in fact it will write nsubxacts as 0
in the existing file, and I think we need to do that right so that in
next open we will know that the nsubxact is 0.

  I think subxact_info_write
> should have a check for nsubxacts > 0 before writing to the file.

But, even if nsubxacts become 0 we want to write the file so that we
can overwrite the previous info.

> 3.
> apply_handle_stream_commit(StringInfo s)
> {
> ..
> + /*
> + * send feedback to upstream
> + *
> + * XXX Probably should send a valid LSN. But which one?
> + */
> + send_feedback(InvalidXLogRecPtr, false, false);
> ..
> }
>
> Why do we need to send the feedback at this stage after applying each
> message?  If we see a non-streamed case, we never send_feedback after
> each message. So, following that, I don't see the need to send it here
> but if you see any specific reason then do let me know?  And if we
> have to send feedback, then we need to decide the appropriate values
> as well.

Let me put more thought on this and then I will revert back to you.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

02 июня 2020 г., 14:25:58

On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Also, what if the changes file size overflows "OS file size limit"?
> > If we agree that the above are problems then do you think we should
> > explore using BufFile interface (see storage/file/buffile.c) to avoid
> > all such problems?
>
> I also think that the file size is a problem.  I think we can use
> BufFile with some modifications.  We can not use the
> BufFileCreateTemp, because of few reasons
> 1) files get deleted on close, but we have to open/close on every
> stream start/stop.
> 2) even if we try to avoid closing we need to the BufFile pointers
> (which take 8192k per file) because there is no option to pass the
> file name.
>
> I thin for our use case BufFileCreateShared is more suitable.  I think
> we need to do some modifications so that we can use these apps without
> SharedFileSet. Otherwise, we need to unnecessarily need to create
> SharedFileSet for each transaction and also need to maintain it in xid
> array or xid hash until transaction commit/abort.  So I suggest
> following modifications in shared files set so that we can
> conveniently use it.
> 1. ChooseTablespace(const SharedFileSet fileset, const char name)
>   if fileset is NULL then select the DEFAULTTABLESPACEOID
> 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> If fileset is NULL then in directory path we can use MyProcPID or
> something instead of fileset->creator_pid.
>

Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
better than the patch maintains sharedfileset information.

> 3. Pass some parameter to BufFileOpenShared, so that it can open the
> file in RW mode instead of read-only mode.
>

This seems okay.

>
> > 2.
> > apply_handle_stream_abort()
> > {
> > ..
> > + /* discard the subxacts added later */
> > + nsubxacts = subidx;
> > +
> > + /* write the updated subxact list */
> > + subxact_info_write(MyLogicalRepWorker->subid, xid);
> > ..
> > }
> >
> > Here, if subxacts becomes zero, then also subxact_info_write will
> > create a new file and write checksum.
>
> How, will it create the new file, in fact it will write nsubxacts as 0
> in the existing file, and I think we need to do that right so that in
> next open we will know that the nsubxact is 0.
>
>   I think subxact_info_write
> > should have a check for nsubxacts > 0 before writing to the file.
>
> But, even if nsubxacts become 0 we want to write the file so that we
> can overwrite the previous info.
>

Can't we just remove the file for such a case?

apply_handle_stream_abort()
{
..
+ /* XXX optimize the search by bsearch on sorted data */
+ for (i = nsubxacts; i > 0; i--)
+ {
+ if (subxacts[i - 1].xid == subxid)
+ {
+ subidx = (i - 1);
+ found = true;
+ break;
+ }
+ }
+
+ /*
+ * If it's an empty sub-transaction then we will not find the subxid
+ * here so just free the memory and return.
+ */
+ if (!found)
+ {
+ /* Free the subxacts memory */
+ if (subxacts)
+ pfree(subxacts);
+
+ subxacts = NULL;
+ subxact_last = InvalidTransactionId;
+ nsubxacts = 0;
+ nsubxacts_max = 0;
+
+ return;
+ }
..
}

I have one question regarding the above code.  Isn't it possible that
a particular subtransaction id doesn't have any change but others do
we have?  For ex. cases like below:

postgres=# begin;
BEGIN
postgres=*# insert into t1 values(1);
INSERT 0 1
postgres=*# savepoint s1;
SAVEPOINT
postgres=*# savepoint s2;
SAVEPOINT
postgres=*# insert into t1 values(2);
INSERT 0 1
postgres=*# insert into t1 values(3);
INSERT 0 1
postgres=*# Rollback to savepoint s1;
ROLLBACK
postgres=*# commit;

Here, we have performed Rolledback to savepoint s1 which doesn't have
any change of its own.  I think this would have handled but just
wanted to confirm.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

02 июня 2020 г., 17:22:48

On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, May 28, 2020 at 5:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Also, what if the changes file size overflows "OS file size limit"?
> > > If we agree that the above are problems then do you think we should
> > > explore using BufFile interface (see storage/file/buffile.c) to avoid
> > > all such problems?
> >
> > I also think that the file size is a problem.  I think we can use
> > BufFile with some modifications.  We can not use the
> > BufFileCreateTemp, because of few reasons
> > 1) files get deleted on close, but we have to open/close on every
> > stream start/stop.
> > 2) even if we try to avoid closing we need to the BufFile pointers
> > (which take 8192k per file) because there is no option to pass the
> > file name.
> >
> > I thin for our use case BufFileCreateShared is more suitable.  I think
> > we need to do some modifications so that we can use these apps without
> > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > SharedFileSet for each transaction and also need to maintain it in xid
> > array or xid hash until transaction commit/abort.  So I suggest
> > following modifications in shared files set so that we can
> > conveniently use it.
> > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > If fileset is NULL then in directory path we can use MyProcPID or
> > something instead of fileset->creator_pid.
> >
>
> Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> better than the patch maintains sharedfileset information.

I think we might do something better here, maybe by supplying function
pointer or so,  but maintaining sharedfileset which contains different
tablespace/mutext which we don't need at all for our purpose also
doesn't sound very appealing.  Let me see if I can not come up with
some clean way of avoiding the need to shared-fileset then maybe we
can go with the shared fileset idea.

> > 3. Pass some parameter to BufFileOpenShared, so that it can open the
> > file in RW mode instead of read-only mode.
> >
>
> This seems okay.
>
> >
> > > 2.
> > > apply_handle_stream_abort()
> > > {
> > > ..
> > > + /* discard the subxacts added later */
> > > + nsubxacts = subidx;
> > > +
> > > + /* write the updated subxact list */
> > > + subxact_info_write(MyLogicalRepWorker->subid, xid);
> > > ..
> > > }
> > >
> > > Here, if subxacts becomes zero, then also subxact_info_write will
> > > create a new file and write checksum.
> >
> > How, will it create the new file, in fact it will write nsubxacts as 0
> > in the existing file, and I think we need to do that right so that in
> > next open we will know that the nsubxact is 0.
> >
> >   I think subxact_info_write
> > > should have a check for nsubxacts > 0 before writing to the file.
> >
> > But, even if nsubxacts become 0 we want to write the file so that we
> > can overwrite the previous info.
> >
>
> Can't we just remove the file for such a case?

But, as of now, we expect if it is not a first-time stream start then
the file exists.    Actually, currently, it's very easy that if it is
not the first segment we always expect that the file must exist,
otherwise an error.   Now if it is not the first segment then we will
need to handle multiple cases.

a) subxact_info_read need to handle the error case, because the file
may not exist because there was no subxact in last stream or it was
deleted because nsubxact become 0.
b) subxact_info_write,  there will be multiple cases that if nsubxact
was already 0 then we can avoid writing the file, but if it become 0
now we need to remove the file.

Let me think more on that.


>
> apply_handle_stream_abort()
> {
> ..
> + /* XXX optimize the search by bsearch on sorted data */
> + for (i = nsubxacts; i > 0; i--)
> + {
> + if (subxacts[i - 1].xid == subxid)
> + {
> + subidx = (i - 1);
> + found = true;
> + break;
> + }
> + }
> +
> + /*
> + * If it's an empty sub-transaction then we will not find the subxid
> + * here so just free the memory and return.
> + */
> + if (!found)
> + {
> + /* Free the subxacts memory */
> + if (subxacts)
> + pfree(subxacts);
> +
> + subxacts = NULL;
> + subxact_last = InvalidTransactionId;
> + nsubxacts = 0;
> + nsubxacts_max = 0;
> +
> + return;
> + }
> ..
> }
>
> I have one question regarding the above code.  Isn't it possible that
> a particular subtransaction id doesn't have any change but others do
> we have?  For ex. cases like below:
>
> postgres=# begin;
> BEGIN
> postgres=*# insert into t1 values(1);
> INSERT 0 1
> postgres=*# savepoint s1;
> SAVEPOINT
> postgres=*# savepoint s2;
> SAVEPOINT
> postgres=*# insert into t1 values(2);
> INSERT 0 1
> postgres=*# insert into t1 values(3);
> INSERT 0 1
> postgres=*# Rollback to savepoint s1;
> ROLLBACK
> postgres=*# commit;
>
> Here, we have performed Rolledback to savepoint s1 which doesn't have
> any change of its own.  I think this would have handled but just
> wanted to confirm.

But internally, that will send abort for the s2 first, and for that,
we will find xid and truncate, and later we will send abort for s1 but
that we will not find and do nothing?  Anyway, I will test it and let
you know.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

03 июня 2020 г., 12:13:00

On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > we need to do some modifications so that we can use these apps without
> > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > SharedFileSet for each transaction and also need to maintain it in xid
> > > array or xid hash until transaction commit/abort.  So I suggest
> > > following modifications in shared files set so that we can
> > > conveniently use it.
> > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > If fileset is NULL then in directory path we can use MyProcPID or
> > > something instead of fileset->creator_pid.
> > >
> >
> > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > better than the patch maintains sharedfileset information.
>
> I think we might do something better here, maybe by supplying function
> pointer or so,  but maintaining sharedfileset which contains different
> tablespace/mutext which we don't need at all for our purpose also
> doesn't sound very appealing.
>

I think we can say something similar for Relation (rel cache entry as
well) maintained in LogicalRepRelMapEntry.  I think we only need a
pointer to that information.

>  Let me see if I can not come up with
> some clean way of avoiding the need to shared-fileset then maybe we
> can go with the shared fileset idea.
>

Fair enough.
..

> > >
> > > But, even if nsubxacts become 0 we want to write the file so that we
> > > can overwrite the previous info.
> > >
> >
> > Can't we just remove the file for such a case?
>
> But, as of now, we expect if it is not a first-time stream start then
> the file exists.
>

Isn't it primarily because we do subxact_info_write in stop stream
which will create such a file irrespective of whether we have any
subxacts?  If so, isn't that an unnecessary write?

>    Actually, currently, it's very easy that if it is
> not the first segment we always expect that the file must exist,
> otherwise an error.
>

I think we can check if the file doesn't exist then we can initialize
nsubxacts as 0.

>   Now if it is not the first segment then we will
> need to handle multiple cases.
>
> a) subxact_info_read need to handle the error case, because the file
> may not exist because there was no subxact in last stream or it was
> deleted because nsubxact become 0.
> b) subxact_info_write,  there will be multiple cases that if nsubxact
> was already 0 then we can avoid writing the file, but if it become 0
> now we need to remove the file.
>
> Let me think more on that.
>

I feel we should be able to deal with these cases but if you find any
difficulty then let us discuss.  I understand there is some ease if we
always have subxacts file but OTOH it sounds quite awkward that we
need so many file operations to detect the case whether the
transaction has any subtransactions.

> >
> > Here, we have performed Rolledback to savepoint s1 which doesn't have
> > any change of its own.  I think this would have handled but just
> > wanted to confirm.
>
> But internally, that will send abort for the s2 first, and for that,
> we will find xid and truncate, and later we will send abort for s1 but
> that we will not find and do nothing?  Anyway, I will test it and let
> you know.
>

It would be good if we can test and confirm this behavior once.  If it
is not very inconvenient then we can even try to include a test for
the same in the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

03 июня 2020 г., 13:57:40

On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>

The fixes in the latest patchset are correct.  Few minor comments:
v26-0005-Implement-streaming-mode-in-ReorderBuffer
+ /*
+ * Mark toplevel transaction as having catalog changes too if one of its
+ * children has so that the ReorderBufferBuildTupleCidHash can conveniently
+ * check just toplevel transaction and decide whethe we need to build the
+ * hash table or not.  In non-streaming mode we mark the toplevel
+ * transaction in DecodeCommit as we only stream on commit.

Typo, /whethe/whether
missing comma, /In non-streaming mode we/In non-streaming mode, we

v26-0008-Add-support-for-streaming-to-built-in-replicatio
+ /*
+ * This memory context used for per stream data when streaming mode is
+ * enabled.  This context is reeset on each stream stop.
+ */

Can we slightly modify the above comment as "This is used in the
streaming mode for the changes between the start and stop stream
messages.  We reset this context on the stream stop message."?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

04 июня 2020 г., 11:35:09

On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > > we need to do some modifications so that we can use these apps without
> > > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > > SharedFileSet for each transaction and also need to maintain it in xid
> > > > array or xid hash until transaction commit/abort.  So I suggest
> > > > following modifications in shared files set so that we can
> > > > conveniently use it.
> > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > > If fileset is NULL then in directory path we can use MyProcPID or
> > > > something instead of fileset->creator_pid.
> > > >
> > >
> > > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > > better than the patch maintains sharedfileset information.
> >
> > I think we might do something better here, maybe by supplying function
> > pointer or so,  but maintaining sharedfileset which contains different
> > tablespace/mutext which we don't need at all for our purpose also
> > doesn't sound very appealing.
> >
>
> I think we can say something similar for Relation (rel cache entry as
> well) maintained in LogicalRepRelMapEntry.  I think we only need a
> pointer to that information.

Yeah, I see.

> >  Let me see if I can not come up with
> > some clean way of avoiding the need to shared-fileset then maybe we
> > can go with the shared fileset idea.
> >
>
> Fair enough.

While evaluating it further I feel there are a few more problems to
solve if we are using BufFile,  First thing is that in subxact file we
maintain the information of xid and its offset in the changes file.
So now, we will also have to store 'fileno' but that we can find using
BufFileTell.  Yet another problem is that currently, we don't
have the truncate option in the BufFile,  but we need it if the
sub-transaction gets aborted.  I think we can implement an extra
interface with the BufFile and should not be very hard as we already
know the fileno and the offset.  I will evaluate this part further and
let you know about the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Mahendra Singh Thalor

Дата:

04 июня 2020 г., 14:35:50

On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 5:19 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>>
>> On Tue, 26 May 2020 at 16:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> Hi all,
>> On the top of v16 patch set [1], I did some testing for DDL's and DML's to test wal size and performance. Below is the testing summary;
>>
>> Test parameters:
>> wal_level= 'logical
>> max_connections = '150'
>> wal_receiver_timeout = '600s'
>> max_wal_size = '2GB'
>> min_wal_size = '2GB'
>> autovacuum= 'off'
>> checkpoint_timeout= '1d'
>>
>> Test results:
>>
>> CREATE index operationsAdd col int(date) operationsAdd col text operations
>> SN.operation nameLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN changeLSN diff (in bytes)time (in sec)% LSN change
>> 1
>> 1 DDL without patch177280.89116
>> 1.624548
>> 9760.764393
>> 11.475409
>> 339040.80044
>> 2.80792
>> with patch180160.80486810880.763602348560.787108
>> 2
>> 2 DDL without patch198720.860348
>> 2.73752
>> 16320.763199
>> 13.7254902
>> 345600.806086
>> 3.078703
>> with patch204160.83906518560.733147356240.829281
>> 3
>> 3 DDL without patch220160.894891
>> 3.63372093
>> 2 2880.776871
>> 14.685314
>> 352160.803493
>> 3.339391186
>> with patch228160.82802826240.737177363920.800194
>> 4
>> 4 DDL without patch241600.901686
>> 4.4701986
>> 29440.768445
>> 15.217391
>> 358720.77489
>> 3.590544
>> with patch252400.88714333920.768382371600.82777
>> 5
>> 5 DDL without patch263280.901686
>> 4.9832877
>> 36000.751879
>> 15.555555
>> 365280.817928
>> 3.832676
>> with patch276400.91407841600.74709379280.820621
>> 6
>> 6 DDL without patch284720.936385
>> 5.5071649
>> 42560.745179
>> 15.78947368
>> 371840.797043
>> 4.066265
>> with patch300400.95822649280.725321386960.814535
>> 7
>> 8 DDL without patch327601.0022203
>> 6.422466
>> 55680.757468
>> 16.091954
>> 384960.83207
>> 4.509559
>> with patch348640.96677764640.769072402320.903604
>> 8
>> 11 DDL without patch502961.0022203
>> 5.662478
>> 75360.748332
>> 16.666666
>> 404640.822266
>> 5.179913
>> with patch531440.96677787920.750553425600.797133
>> 9
>> 15 DDL without patch588961.267253
>> 5.662478
>> 101840.776875
>> 16.496465
>> 431120.821916
>> 5.84524
>> with patch627681.27234118640.746844456320.812567
>> 10
>> 1 DDL & 3 DML without patch182400.812551
>> 1.6228
>> 11920.771993
>> 10.067114
>> 341200.849467
>> 2.8113599
>> with patch185360.81908913120.785117350800.855456
>> 11
>> 3 DDL & 5 DML without patch236560.926616
>> 3.4832606
>> 26560.758029
>> 13.55421687
>> 355840.829377
>> 3.372302
>> with patch244800.91551730160.797206367840.839176
>> 12
>> 10 DDL & 5 DML without patch527601.101005
>> 4.958301744
>> 72880.763065
>> 16.02634468
>> 402160.837843
>> 4.993037
>> with patch553761.10524184560.779257422240.835206
>> 13
>> 10 DML without patch10080.791091
>> 6.349206
>> 10080.81105
>> 6.349206
>> 10080.78817
>> 6.349206
>> with patch10720.80787510720.77111310720.759789
>>
>> To see all operations, please see[2] test_results
>>
>
> Why are you seeing any additional WAL in case-13 (10 DML) where there is no DDL? I think it is because you have used savepoints in that case which will add some additional WAL. You seems to have 9 savepoints in that test which should ideally generate 36 bytes of additional WAL (4-byte per transaction id for each subtransaction). Also, in other cases where you took data for DDL and DML, you have also used savepoints in those tests. I suggest for savepoints, let's do separate tests as you have done in case-13 but we can do it 3,5,7,10 savepoints and probably each transaction can update a row of 200 bytes or so.
>

Thanks Amit for reviewing results.

Yes, you are correct. I used savepoints in DML so it was showing additional wal.

As suggested above, I did testing for DML's, DDL's and savepoints. Below is the test results:

Test results:

		CREATE index operations			Add col int(date) operations			Add col text operations
SN.	operation name	LSN diff (in bytes)	time (in sec)	% LSN change	LSN diff (in bytes)	time (in sec)	% LSN change	LSN diff (in bytes)	time (in sec)	% LSN change
1	1 DDL without patch	17728	0.89116	1.624548	976	0.764393	11.475409	33904	0.80044	2.80792
1	with patch	18016	0.804868	1.624548	1088	0.763602	11.475409	34856	0.787108	2.80792
2	2 DDL without patch	19872	0.860348	2.73752	1632	0.763199	13.7254902	34560	0.806086	3.078703
2	with patch	20416	0.839065	2.73752	1856	0.733147	13.7254902	35624	0.829281	3.078703
3	3 DDL without patch	22016	0.894891	3.63372093	2288	0.776871	14.685314	35216	0.803493	3.339391186
3	with patch	22816	0.828028	3.63372093	2624	0.737177	14.685314	36392	0.800194	3.339391186
4	4 DDL without patch	24160	0.901686	4.4701986	2944	0.768445	15.217391	35872	0.77489	3.590544
4	with patch	25240	0.887143	4.4701986	3392	0.768382	15.217391	37160	0.82777	3.590544
5	5 DDL without patch	26328	0.901686	4.9832877	3600	0.751879	15.555555	36528	0.817928	3.832676
5	with patch	27640	0.914078	4.9832877	4160	0.74709	15.555555	37928	0.820621	3.832676
6	6 DDL without patch	28472	0.936385	5.5071649	4256	0.745179	15.78947368	37184	0.797043	4.066265
6	with patch	30040	0.958226	5.5071649	4928	0.725321	15.78947368	38696	0.814535	4.066265
7	8 DDL without patch	32760	1.0022203	6.422466	5568	0.757468	16.091954	38496	0.83207	4.509559
7	with patch	34864	0.966777	6.422466	6464	0.769072	16.091954	40232	0.903604	4.509559
8	11 DDL without patch	50296	1.0022203	5.662478	7536	0.748332	16.666666	40464	0.822266	5.179913
8	with patch	53144	0.966777	5.662478	8792	0.750553	16.666666	42560	0.797133	5.179913
9	15 DDL without patch	58896	1.267253	5.662478	10184	0.776875	16.496465	43112	0.821916	5.84524
9	with patch	62768	1.27234	5.662478	11864	0.746844	16.496465	45632	0.812567	5.84524
10	1 DDL & 3 DML without patch	18224	0.865753	1.58033362	1176	0.78074	9.523809	34104	0.857664	2.7914614
10	with patch	18512	0.854788	1.58033362	1288	0.767758	9.523809	35056	0.877604	2.7914614
11	3 DDL & 5 DML without patch	23632	0.954274	3.385203	2632	0.785501	12.765957	35560	0.87744	3.3070866
11	with patch	24432	0.927245	3.385203	2968	0.857528	12.765957	36736	0.867555	3.3070866
12	3 DDL & 10 DML without patch	25088	0.941534	3.316326	3040	0.812123	11.052631	35968	0.877769	3.269579
12	with patch	25920	0.898643	3.316326	3376	0.804943	11.052631	37144	0.879752	3.269579
13	3 DDL & 15 DML without patch	26400	0.949599	3.151515	3392	0.818491	9.90566037	36320	0.859353	3.2378854
13	with patch	27232	0.892505	3.151515	3728	0.789752	9.90566037	37320	0.812386	3.2378854
14	5 DDL & 15 DML without patch	31904	0.994223	4.287863	4704	0.838091	11.904761	37632	0.867281	3.720238095
14	with patch	33272	0.968122	4.287863	5264	0.816922	11.904761	39032	0.876364	3.720238095
15	1 DML without patch	328	0.817988	0
15	with patch	328	0.794927	0
16	3 DML without patch	464	0.791229	0
16	with patch	464	0.806211	0
17	5 DML without patch	608	0.794258	0
17	with patch	608	0.802001	0
18	10 DML without patch	968	0.831733	0
18	with patch	968	0.852777	0

Results for savepoints:

SN.	Operation name	Operation	LSN diff (in bytes)	time (in sec)	% LSN change
1	1 savepoint without patch	begin; insert into perftest values (1); savepoint s1; update perftest set c1 = 5 where c1 = 1; commit;	408	0.805615	1.960784
	with patch		416	0.823121
2	2 savepoint without patch	begin; insert into perftest values (1); savepoint s1; update perftest set c1 = 5 where c1 = 1; savepoint s2; update perftest set c1 = 6 where c1 = 5; commit;	488	0.827147	3.278688
	with patch		504	0.819165
3	3 savepoint without patch	begin; insert into perftest values (1); savepoint s1; update perftest set c1 = 2 where c1 = 1; savepoint s2; update perftest set c1 = 3 where c1 = 2; savepoint s3; update perftest set c1 = 4 where c1 = 3; commit;	560	0.806441	4.28571428
	with patch		584	0.821316
4	5 savepoint without patch		712	0.823774	5.617977528
	with patch		752	0.800037
5	7 savepoint without patch		864	0.829136	6.48148148
	with patch		920	0.793751
6	10 savepoint without patch		1096	0.77946	7.29927007
	with patch		1176	0.78711

To see all the operations(DDL's and DML's), please see test_results

Testing summary:

Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation. I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast

There is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal increment (basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some times if wal is already aligned, then we are getting 0 bytes increment)

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

05 июня 2020 г., 09:07:39

On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Apart from this one more fix in 0005,  basically, CheckLiveXid was
> never reset, so I have fixed that as well.
>

I have made a number of modifications in the 0001 patch and attached
is the result.  I have changed/added comments, done some cosmetic
cleanup, and ran pgindent.  The most notable change is to remove the
below code change:
DecodeXactOp()
{
..
- * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
+ * However, it's critical to process records with subxid assignment even
  * when the snapshot is being built: it is possible to get later records
  * that require subxids to be properly assigned.
  */
  if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
- info != XLOG_XACT_ASSIGNMENT)
+ !TransactionIdIsValid(XLogRecGetTopXid(r)))
..
}

I have not only removed the change done by the patch but the check
related to XLOG_XACT_ASSIGNMENT as well.  That check has been added by
commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even
if snapshot state is not SNAPBUILD_FULL_SNAPSHOT.  Now, with this
patch that is not required because we are making the subtransaction
and top-level transaction much earlier than this.  I have verified
that it doesn't reopen the bug by running the test provided in the
original report [1].

Let me know what you think of the changes?  If you find them okay,
then feel to include them in the next patch-set.

[1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

v27-0001-Immediately-WAL-log-subtransaction-and-top-level.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

07 июня 2020 г., 14:36:11

On Thu, Jun 4, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > > > we need to do some modifications so that we can use these apps without
> > > > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > > > SharedFileSet for each transaction and also need to maintain it in xid
> > > > > array or xid hash until transaction commit/abort.  So I suggest
> > > > > following modifications in shared files set so that we can
> > > > > conveniently use it.
> > > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > > > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > > > If fileset is NULL then in directory path we can use MyProcPID or
> > > > > something instead of fileset->creator_pid.
> > > > >
> > > >
> > > > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > > > better than the patch maintains sharedfileset information.
> > >
> > > I think we might do something better here, maybe by supplying function
> > > pointer or so,  but maintaining sharedfileset which contains different
> > > tablespace/mutext which we don't need at all for our purpose also
> > > doesn't sound very appealing.
> > >
> >
> > I think we can say something similar for Relation (rel cache entry as
> > well) maintained in LogicalRepRelMapEntry.  I think we only need a
> > pointer to that information.
>
> Yeah, I see.
>
> > >  Let me see if I can not come up with
> > > some clean way of avoiding the need to shared-fileset then maybe we
> > > can go with the shared fileset idea.
> > >
> >
> > Fair enough.
>
> While evaluating it further I feel there are a few more problems to
> solve if we are using BufFile,  First thing is that in subxact file we
> maintain the information of xid and its offset in the changes file.
> So now, we will also have to store 'fileno' but that we can find using
> BufFileTell.  Yet another problem is that currently, we don't
> have the truncate option in the BufFile,  but we need it if the
> sub-transaction gets aborted.  I think we can implement an extra
> interface with the BufFile and should not be very hard as we already
> know the fileno and the offset.  I will evaluate this part further and
> let you know about the same.

I have further evaluated this and also tested the concept with a POC
patch.  Soon I will complete and share, here is the scatch of the
idea.

As discussed we will use SharedBufFile for changes files and subxact
files.  There will be a separate LogicalStreamingResourceOwner, which
will be used to manage the VFD of the shared buf files.  We can create
a per stream resource owner i.e. on stream start we will create the
resource owner and all the shared buffiles will be opened under that
resource owner, which will be deleted on stream stop.   We need to
remember the SharedFileSet so that for subsequent stream for the same
transaction we can open the same file again, for this we will use a
hash table with xid as a key and in that, we will keep stream_fileset
and subxact_fileset's pointers as payload.

+typedef struct StreamXidHash
+{
+       TransactionId   xid;
+       SharedFileSet  *stream_fileset;
+       SharedFileSet  *subxact_fileset;
+} StreamXidHash;

We have to do some extension to the buffile modules, some of them are
already discussed up-thread but still listing them all down here
- A new interface BufFileTruncateShared(BufFile *file, int fileno,
off_t offset), for truncating the subtransaction changes, if changes
are spread across multiple files those files will be deleted and we
will adjust the file count and current offset accordingly in BufFile.
- In BufFileOpenShared, we will have to implement a mode so that we
can open in write mode as well, current only read-only mode supported.
- In SharedFileSetInit, if dsm_segment is NULL then we will not
register the file deletion on on_dsm_detach.
- As usual, we will clean up the files on stream abort/commit, or on
the worker exit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

07 июня 2020 г., 14:37:47

On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, May 29, 2020 at 8:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Apart from this one more fix in 0005,  basically, CheckLiveXid was
> > never reset, so I have fixed that as well.
> >
>
> I have made a number of modifications in the 0001 patch and attached
> is the result.  I have changed/added comments, done some cosmetic
> cleanup, and ran pgindent.  The most notable change is to remove the
> below code change:
> DecodeXactOp()
> {
> ..
> - * However, it's critical to process XLOG_XACT_ASSIGNMENT records even
> + * However, it's critical to process records with subxid assignment even
>   * when the snapshot is being built: it is possible to get later records
>   * that require subxids to be properly assigned.
>   */
>   if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT &&
> - info != XLOG_XACT_ASSIGNMENT)
> + !TransactionIdIsValid(XLogRecGetTopXid(r)))
> ..
> }
>
> I have not only removed the change done by the patch but the check
> related to XLOG_XACT_ASSIGNMENT as well.  That check has been added by
> commit bac2fae05c to ensure that we process XLOG_XACT_ASSIGNMENT even
> if snapshot state is not SNAPBUILD_FULL_SNAPSHOT.  Now, with this
> patch that is not required because we are making the subtransaction
> and top-level transaction much earlier than this.  I have verified
> that it doesn't reopen the bug by running the test provided in the
> original report [1].
>
> Let me know what you think of the changes?  If you find them okay,
> then feel to include them in the next patch-set.
>
> [1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com

Thanks for the patch, I will review it and include it in my next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

08 июня 2020 г., 09:23:14

On Sun, Jun 7, 2020 at 5:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Let me know what you think of the changes?  If you find them okay,
> > then feel to include them in the next patch-set.
> >
> > [1] - https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com
>
> Thanks for the patch, I will review it and include it in my next version.
>

Okay, I have done review of
0002-Issue-individual-invalidations-with-wal_level-lo.patch and below
are my comments:

1. I don't think it is a good idea that logical decoding process the
new XLOG_XACT_INVALIDATIONS and existing WAL records for invalidations
like XLOG_INVALIDATIONS and what we do in DecodeCommit (see code in
the check "if (parsed->nmsgs > 0)").  I think if that is required for
some particular reason then we should write detailed comments about
the same.  I have tried some experiments to see if those are really
required:
a. After applying patch 0002, I have tried by commenting out the
processing of invalidations via DecodeCommit and found some regression
tests were failing but the reason for failure was that we are not
setting RBTXN_HAS_CATALOG_CHANGES for the toptxn when subtxn has
catalog changes and when I did that all regression tests started
passing.  See the attached diff patch
(v27-0003-Incremental-patch-for-0002-to-test-removal-of-du) atop 0002
patch.
b. The processing of invalidations for XLOG_INVALIDATIONS is added by
commit c6ff84b06a for xid-less transactions.  See
https://postgr.es/m/CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
to know why that has been added.  Now, after this patch we will
process the same invalidations via XLOG_XACT_INVALIDATIONS and
XLOG_INVALIDATIONS which doesn't seem warranted.  Also, the below
assertion will fail for xid-less transactions (try create index
concurrently statement):
+ case XLOG_XACT_INVALIDATIONS:
+ {
+ TransactionId xid;
+ xl_xact_invalidations *invals;
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));

I feel we don't need the processing of XLOG_INVALIDATIONS in logical
decoding after this patch but to prove that first we need to write a
test case which need XLOG_INVALIDATIONS in the HEAD as commit
c6ff84b06a doesn't add one.  I think we need two code paths in
XLOG_XACT_INVALIDATIONS where if it is for xid-less transactions, then
execute actions immediately as we are doing in processing of
XLOG_INVALIDATIONS, otherwise, do what we are doing currently in the
patch.  If the above point (b) is correct, I am not sure if it is a
good idea to use RM_XACT_ID as resource manager if for this WAL in
LogLogicalInvalidations, what do you think?

I think one of the usages we still need is in ReorderBufferForget
because it can be called when we skip processing the txn.  See the
comments in DecodeCommit where we call this function.  If I am
correct, we need to probably collect all invalidations in
ReorderBufferTxn as we are collecting tuplecids and use them here.  We
can do the same during processing of XLOG_XACT_INVALIDATIONS.

I had also thought a bit about removing logging of invalidations at
commit time altogether but it seems processing hot-standby is somewhat
tightly coupled with existing WAL logging.  See xact_redo_commit (a
comment atop call to ProcessCommittedInvalidationMessages).  It says
we need to maintain the order when we process invalidations.  If we
can later find a way to avoid that we can probably remove it but for
now maybe we can live with it.

2.
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALSMGR_ID)
+ appendStringInfoString(buf, " smgr");
+ /* not expected, but print something anyway */
+ else if (msg->id == SHAREDINVALRELMAP_ID)

I think the above comment is not valid after we started logging at CCI.

3.
+
+ xid = XLogRecGetXid(r);
+ invals = (xl_xact_invalidations *) XLogRecGetData(r);
+
+ Assert(TransactionIdIsValid(xid));
+ ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
+ invals->nmsgs, invals->msgs);

Here, it should check !ctx->forward as we do in DecodeCommit, do we
have any reason for not doing so.  We can test once by changing this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

09 июня 2020 г., 12:34:33

On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I think one of the usages we still need is in ReorderBufferForget
> because it can be called when we skip processing the txn.  See the
> comments in DecodeCommit where we call this function.  If I am
> correct, we need to probably collect all invalidations in
> ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> can do the same during processing of XLOG_XACT_INVALIDATIONS.
>

One more point related to this is that after this patch series, we
need to consider executing all invalidation during transaction abort.
Because it is possible that due to memory overflow, we have processed
some of the messages which also contain a few XACT_INVALIDATION
messages, so to avoid cache pollution, we need to execute all of them
in abort.  We also do the similar thing in Rollback/Rollback To
Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

Few other comments on
0002-Issue-individual-invalidations-with-wal_level-lo.patch
---------------------------------------------------------------------------------------------------------------
1.
+ if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
+ {
+ ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
+ MakeSharedInvalidMessagesArray);
+ invalMessages = SharedInvalidMessagesArray;
+ nmsgs  = numSharedInvalidMessagesArray;
+ SharedInvalidMessagesArray = NULL;
+ numSharedInvalidMessagesArray = 0;

a. Immediately after ProcessInvalidationMessagesMulti, isn't it better
to have an Assertion like Assert(!(numSharedInvalidMessagesArray > 0
&& SharedInvalidMessagesArray == NULL));?
b. Why check "if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)" is
required?  If you see xactGetCommittedInvalidationMessages where we do
something similar, we only check for valid value of transInvalInfo and
here we check the same in the caller of LogLogicalInvalidations, isn't
that sufficient?  If that is sufficient, we can either have the same
check here or have an Assert for the same.

2.
@@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
  if (transInvalInfo == NULL)
  return;

+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();
+
  ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
  LocalExecuteInvalidationMessage);
Generally, we WAL log the action after performing it but here you are
writing WAL first.  Is there any specific reason?  If so, can we write
a comment about the same?

3.
+ * When wal_level=logical, write invalidations into WAL at each command end to
+ * support the decoding of the in-progress transaction.  As of now it was
+ * enough to log invalidation only at commit because we are only decoding the
+ * transaction at the commit time.   We only need to log the catalog cache and
+ * relcache invalidation.  There can not be any active MVCC scan in logical
+ * decoding so we don't need to log the snapshot invalidation.

I think this comment doesn't hold good after we have changed the patch
to LOG invalidations at the time of CCI.

4.
+
+/*
+ * Emit WAL for invalidations.
+ */
+static void
+LogLogicalInvalidations()

Add the function name atop of this function in comments to match the
style with other nearby functions.  How about modifying it to
something like: "Emit WAL for invalidations.  This is currently only
used for logging invalidations at the command end."

5.
+ *
+ * XXX Do we need to care about relcacheInitFileInval and
+ * the other fields added to ReorderBufferChange, or just
+ * about the message itself?
+ */

I don't think we need to do anything about relcacheInitFileInval.
This is used to remove the stale files (RELCACHE_INIT_FILENAME) that
have obsolete information about relcache.  The walsender process that
is doing decoding doesn't require us to do anything about this.  Also,
if you see before this patch, we don't do anything about relcache
files during decoding of invalidation messages.  In short, I think we
can remove this comment unless you see some use of it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

09 июня 2020 г., 13:09:39

On Thu, Jun 4, 2020 at 5:06 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:

On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>

To see all the operations(DDL's and DML's), please see test_results

Testing summary:
Basically, we are writing per command invalidation message and for testing that I have tested with different combinations of the DDL and DML operation. I have not observed any performance degradation with the patch. For "create index" DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for "add col text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%.

why are we seeing 11-13 % of the extra wall, basically, the amount of extra WAL is not very high but the amount of WAL generated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column text it is ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast
There is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal increment (basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some times if wal is already aligned, then we are getting 0 bytes increment)

So, if I read it correctly, there is no performance penalty with either of the patches but there is some additional WAL which in most cases is 2-5% but in worst cases and some specific DDL's it is upto 15%. I think as this WAL overhead is when wal_level is logical, we might have to live with it as the other alternative is to blew up all caches on any DDL in WALSenders and that will have bot CPU and Network overhead as expalined previously [1]. I feel if the WAL overhead pinches any workload, we might want to do it under some new guc (which will disable streaming of transactions) but I don't think we need to go there.

What do you think?

[1] - https://www.postgresql.org/message-id/CAA4eK1JaKW1mj4L6DPnk-V4vXJ6hM%3DKcf6%2B-X%2B93Jk56UN%2BkGw%40mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

09 июня 2020 г., 13:22:21

On Tue, Jun 9, 2020 at 3:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 4, 2020 at 5:06 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>>
>> On Fri, 29 May 2020 at 15:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>>
>>
>> To see all the operations(DDL's and DML's), please see test_results
>>
>> Testing summary:
>> Basically, we are writing per command invalidation message and for testing that I have tested with different
combinationsof the DDL and DML operation.  I have not observed any performance degradation with the patch. For "create
index"DDL's, %change in wal is 1-7% for 1-15 DDL's. For "add col int/date" DDL's, it is 11-17% for 1-15 DDL's and for
"addcol text" DDL's, it is 2-6% for 1-15 DDL's. For mix (DDL & DML), it is 2-10%. 
>>
>> why are we seeing 11-13 % of the extra wall, basically,  the amount of extra WAL is not very high but the amount of
WALgenerated with add column int/date is just ~1000 bytes so additional 100 bytes will be around 10% and for add column
textit is  ~35000 bytes so % is less. For text, these ~35000 bytes are due to toast 
>> There is no change in wal size for DML operations. For savepoints, we are getting max 8 bytes per savepoint wal
increment(basically for Sub-transaction, we are adding 5 bytes to store xid but due to padding, it is 8 bytes and some
timesif wal is already aligned, then we are getting 0 bytes increment) 
>
>
> So, if I read it correctly, there is no performance penalty with either of the patches but there is some additional
WALwhich in most cases is 2-5% but in worst cases and some specific DDL's it is upto 15%.  I think as this WAL overhead
iswhen wal_level is logical, we might have to live with it as the other alternative is to blew up all caches on any DDL
inWALSenders and that will have bot CPU and Network overhead as expalined previously [1].  I feel if the WAL overhead
pinchesany workload, we might want to do it under some new guc (which will disable streaming of transactions) but I
don'tthink we need to go there. 
>
> What do you think?

Even I feel so because the WAL overhead is only with wal_level=logical
and especially with DDL and ideally, there should not be a large amount
of DDL in the system compared to other operations.  So I think we can live
with the current approach.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

10 июня 2020 г., 12:00:15

On Sun, Jun 7, 2020 at 5:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jun 4, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 3, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > >
> > > > > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > > > > we need to do some modifications so that we can use these apps without
> > > > > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > > > > SharedFileSet for each transaction and also need to maintain it in xid
> > > > > > array or xid hash until transaction commit/abort.  So I suggest
> > > > > > following modifications in shared files set so that we can
> > > > > > conveniently use it.
> > > > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > > > > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > > > > If fileset is NULL then in directory path we can use MyProcPID or
> > > > > > something instead of fileset->creator_pid.
> > > > > >
> > > > >
> > > > > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > > > > better than the patch maintains sharedfileset information.
> > > >
> > > > I think we might do something better here, maybe by supplying function
> > > > pointer or so,  but maintaining sharedfileset which contains different
> > > > tablespace/mutext which we don't need at all for our purpose also
> > > > doesn't sound very appealing.
> > > >
> > >
> > > I think we can say something similar for Relation (rel cache entry as
> > > well) maintained in LogicalRepRelMapEntry.  I think we only need a
> > > pointer to that information.
> >
> > Yeah, I see.
> >
> > > >  Let me see if I can not come up with
> > > > some clean way of avoiding the need to shared-fileset then maybe we
> > > > can go with the shared fileset idea.
> > > >
> > >
> > > Fair enough.
> >
> > While evaluating it further I feel there are a few more problems to
> > solve if we are using BufFile,  First thing is that in subxact file we
> > maintain the information of xid and its offset in the changes file.
> > So now, we will also have to store 'fileno' but that we can find using
> > BufFileTell.  Yet another problem is that currently, we don't
> > have the truncate option in the BufFile,  but we need it if the
> > sub-transaction gets aborted.  I think we can implement an extra
> > interface with the BufFile and should not be very hard as we already
> > know the fileno and the offset.  I will evaluate this part further and
> > let you know about the same.
>
> I have further evaluated this and also tested the concept with a POC
> patch.  Soon I will complete and share, here is the scatch of the
> idea.
>
> As discussed we will use SharedBufFile for changes files and subxact
> files.  There will be a separate LogicalStreamingResourceOwner, which
> will be used to manage the VFD of the shared buf files.  We can create
> a per stream resource owner i.e. on stream start we will create the
> resource owner and all the shared buffiles will be opened under that
> resource owner, which will be deleted on stream stop.   We need to
> remember the SharedFileSet so that for subsequent stream for the same
> transaction we can open the same file again, for this we will use a
> hash table with xid as a key and in that, we will keep stream_fileset
> and subxact_fileset's pointers as payload.
>
> +typedef struct StreamXidHash
> +{
> +       TransactionId   xid;
> +       SharedFileSet  *stream_fileset;
> +       SharedFileSet  *subxact_fileset;
> +} StreamXidHash;
>
> We have to do some extension to the buffile modules, some of them are
> already discussed up-thread but still listing them all down here
> - A new interface BufFileTruncateShared(BufFile *file, int fileno,
> off_t offset), for truncating the subtransaction changes, if changes
> are spread across multiple files those files will be deleted and we
> will adjust the file count and current offset accordingly in BufFile.
> - In BufFileOpenShared, we will have to implement a mode so that we
> can open in write mode as well, current only read-only mode supported.
> - In SharedFileSetInit, if dsm_segment is NULL then we will not
> register the file deletion on on_dsm_detach.
> - As usual, we will clean up the files on stream abort/commit, or on
> the worker exit.

Currently, I am done with a working prototype of using the BufFile
infrastructure for the tempfile.  Meanwhile, I want to discuss a few
interface changes required for the BufFIle infrastructure.

1. Support read-write mode for "BufFileOpenShared",  Basically, in
workers we will be opening the xid's changes and subxact files per
stream, so we need an RW mode even in the open.  I have passed a flag
for the same.

2. Files should not be closed at the end of the transaction:
Currently, files opened with BufFileCreateShared/BufFileOpenShared are
registered to be closed on EOXACT.  Basically, we need to open the
changes file on the stream start and keep it open until stream stop,
so we can not afford to get it closed on the EOXACT.  I have added a
flag for the same.

3.  As. discussed above we need to support truncate for handling thee
subtransaction abort so I have added a new interface for the same.

4.  After every time we open the changes file, we need to seek to the
end, so I have supported SEEK_END.

Attached is the WIP patch for describing my changes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

buffile_change.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

10 июня 2020 г., 13:30:26

On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> Currently, I am done with a working prototype of using the BufFile
> infrastructure for the tempfile.  Meanwhile, I want to discuss a few
> interface changes required for the BufFIle infrastructure.
>
> 1. Support read-write mode for "BufFileOpenShared",  Basically, in
> workers we will be opening the xid's changes and subxact files per
> stream, so we need an RW mode even in the open.  I have passed a flag
> for the same.
>

Generally file open APIs have mode as a parameter to indicate
read_only or read_write.  Using flag here seems a bit odd to me.

> 2. Files should not be closed at the end of the transaction:
> Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> registered to be closed on EOXACT.  Basically, we need to open the
> changes file on the stream start and keep it open until stream stop,
> so we can not afford to get it closed on the EOXACT.  I have added a
> flag for the same.
>

But where do we end the transaction before the stream stop which can
lead to closure of this file?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

10 июня 2020 г., 14:31:50

On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 10, 2020 at 2:30 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > Currently, I am done with a working prototype of using the BufFile
> > infrastructure for the tempfile.  Meanwhile, I want to discuss a few
> > interface changes required for the BufFIle infrastructure.
> >
> > 1. Support read-write mode for "BufFileOpenShared",  Basically, in
> > workers we will be opening the xid's changes and subxact files per
> > stream, so we need an RW mode even in the open.  I have passed a flag
> > for the same.
> >
>
> Generally file open APIs have mode as a parameter to indicate
> read_only or read_write.  Using flag here seems a bit odd to me.

Let me think about it, we can try to pass the mode.

> > 2. Files should not be closed at the end of the transaction:
> > Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> > registered to be closed on EOXACT.  Basically, we need to open the
> > changes file on the stream start and keep it open until stream stop,
> > so we can not afford to get it closed on the EOXACT.  I have added a
> > flag for the same.
> >
>
> But where do we end the transaction before the stream stop which can
> lead to closure of this file?

Currently, I am keeping the transaction only while creating/opening
the files and closing immediately after that,  maybe we can keep the
transaction until stream stop, then we can avoid this changes,  and we
can also avoid creating extra resource owner?  What is your thought on
this?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

10 июня 2020 г., 14:40:49

On Wed, Jun 10, 2020 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > 2. Files should not be closed at the end of the transaction:
> > > Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> > > registered to be closed on EOXACT.  Basically, we need to open the
> > > changes file on the stream start and keep it open until stream stop,
> > > so we can not afford to get it closed on the EOXACT.  I have added a
> > > flag for the same.
> > >
> >
> > But where do we end the transaction before the stream stop which can
> > lead to closure of this file?
>
> Currently, I am keeping the transaction only while creating/opening
> the files and closing immediately after that,  maybe we can keep the
> transaction until stream stop, then we can avoid this changes,  and we
> can also avoid creating extra resource owner?  What is your thought on
> this?
>

I would prefer to keep the transaction until the stream stop unless
there are good reasons for not doing so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

12 июня 2020 г., 09:08:23

On Wed, Jun 10, 2020 at 5:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 10, 2020 at 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 10, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > 2. Files should not be closed at the end of the transaction:
> > > > Currently, files opened with BufFileCreateShared/BufFileOpenShared are
> > > > registered to be closed on EOXACT.  Basically, we need to open the
> > > > changes file on the stream start and keep it open until stream stop,
> > > > so we can not afford to get it closed on the EOXACT.  I have added a
> > > > flag for the same.
> > > >
> > >
> > > But where do we end the transaction before the stream stop which can
> > > lead to closure of this file?
> >
> > Currently, I am keeping the transaction only while creating/opening
> > the files and closing immediately after that,  maybe we can keep the
> > transaction until stream stop, then we can avoid this changes,  and we
> > can also avoid creating extra resource owner?  What is your thought on
> > this?
> >
>
> I would prefer to keep the transaction until the stream stop unless
> there are good reasons for not doing so.

I am ready with the first patch set which replaces the temp file usage
in the worker with the buffile usage. (patch v27-0013 and v27-0014)

Open item:
- As of now, I have kept the buffile changes and the worker using
buffile as separate patches for review.  Later I will make buffile
changes patch as a base patch and I will merge the worker changes with
the 0008 patch.

- Currently, while reading/writing the streaming/subxact files we are
reporting the wait event for example
'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);',  but
BufFileWrite/BufFileRead internally reports the read/write wait event.
So I think we can avoid reporting that?  Basically, this part is still
I have to work upon, once we get the consensus then I can remove those
extra wait event from the patch.

- There are still a few open comments, from your other mails, I still
have to work upon.  So I will work on those in the next version.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v27.tar

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

12 июня 2020 г., 14:04:52

On Fri, Jun 12, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> - Currently, while reading/writing the streaming/subxact files we are
> reporting the wait event for example
> 'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);',  but
> BufFileWrite/BufFileRead internally reports the read/write wait event.
> So I think we can avoid reporting that?
>

Yes, we can avoid that.  No other place using BufFileRead does any
such reporting.

>  Basically, this part is still
> I have to work upon, once we get the consensus then I can remove those
> extra wait event from the patch.
>

Okay, feel free to send an updated patch with the above change.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

15 июня 2020 г., 06:41:58

On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > - Currently, while reading/writing the streaming/subxact files we are
> > reporting the wait event for example
> > 'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);',  but
> > BufFileWrite/BufFileRead internally reports the read/write wait event.
> > So I think we can avoid reporting that?
> >
>
> Yes, we can avoid that.  No other place using BufFileRead does any
> such reporting.

I agree.

> >  Basically, this part is still
> > I have to work upon, once we get the consensus then I can remove those
> > extra wait event from the patch.
> >
>
> Okay, feel free to send an updated patch with the above change.

Sure, I will do that in the next patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

15 июня 2020 г., 15:59:54

On Mon, Jun 15, 2020 at 9:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > >  Basically, this part is still
> > > I have to work upon, once we get the consensus then I can remove those
> > > extra wait event from the patch.
> > >
> >
> > Okay, feel free to send an updated patch with the above change.
>
> Sure, I will do that in the next patch set.
>

I have few more comments on the patch
0013-Change-buffile-interface-required-for-streaming-.patch:

1.
- * temp_file_limit of the caller, are read-only and are automatically closed
- * at the end of the transaction but are not deleted on close.
+ * temp_file_limit of the caller, are read-only if the flag is set and are
+ * automatically closed at the end of the transaction but are not deleted on
+ * close.
  */
 File
-PathNameOpenTemporaryFile(const char *path)
+PathNameOpenTemporaryFile(const char *path, int mode)

No need to say "are read-only if the flag is set".  I don't see any
flag passed to function so that part of the comment doesn't seem
appropriate.

2.
@@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  }

  /* Register our cleanup callback. */
- on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ if (seg)
+ on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
 }

Add comments atop function to explain when we don't want to register
the dsm detach stuff?

3.
+ */
+ newFile = file->numFiles - 1;
+ newOffset = FileSize(file->files[file->numFiles - 1]);
  break;

FileSize can return negative lengths to indicate failure which we
should handle.  See other places in the code where FileSize is used?
But I have another question here which is why we need to implement
SEEK_END?  How other usages of BufFile interface takes care of this?
I see an API BufFileTell which can give the current read/write
location in the file, isn't that sufficient for your usage?  Also, how
before BufFile usage is this thing handled in the patch?

4.
+ /* Loop over all the  files upto the fileno which we want to truncate. */
+ for (i = file->numFiles - 1; i >= fileno; i--)

"the  files", extra space in the above part of the comment.

5.
+ /*
+ * Except the fileno,  we can directly delete other files.

Before 'we', there is extra space.

6.
+ else
+ {
+ FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
+ newOffset = offset;
+ }

The wait event passed here doesn't seem to be appropriate.  You might
want to introduce a new wait event WAIT_EVENT_BUFFILE_TRUNCATE.  Also,
the error handling for FileTruncate is missing.

7.
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ SharedFileSetDelete(file->fileset, segment_name, true);
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }

Similar to the previous comment, I think we should handle the failure
of SharedFileSetDelete.

8. I think the comments related to BufFile shared API usage need to be
expanded in the code to explain the new usage.  For ex., see the below
comments atop buffile.c
* BufFile supports temporary files that can be made read-only and shared with
* other backends, as infrastructure for parallel execution.  Such files need
* to be created as a member of a SharedFileSet that all participants are
* attached to.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

16 июня 2020 г., 12:06:55

On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have few more comments on the patch
> 0013-Change-buffile-interface-required-for-streaming-.patch:
>

Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
1.
The subxact file is only create if there
+ * are any suxact info under this xid.
+ */
+typedef struct StreamXidHash

Lets slightly reword the part of the comment as "The subxact file is
created iff there is any suxact info under this xid."

2.
@@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
  subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
  stream_close_file();

+ /* Commit the per-stream transaction */
+ CommitTransactionCommand();

Before calling commit, ensure that we are in a valid transaction.  I
think we can have an Assert for IsTransactionState().

3.
@@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)

  int64 i;
  int64 subidx;
- int fd;
+ BufFile    *fd;
  bool found = false;
  char path[MAXPGPATH];
+ StreamXidHash *ent;

  subidx = -1;
+ ensure_transaction();
  subxact_info_read(MyLogicalRepWorker->subid, xid);

Why to call ensure_transaction here?  Is there any reason that we
won't have a valid transaction by now?  If not, then its better to
have an Assert for IsTransactionState().

4.
- if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
+ if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
  {
- int save_errno = errno;
+ int save_errno = errno;

- CloseTransientFile(fd);
+ BufFileClose(fd);

On error, won't these files be close automatically?  If so, why at
this place and before other errors, we need to close this?

5.
if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not read file \"%s\": %m",

Can we change the error message to "could not read from streaming
transactions file .." or something like that and similarly we can
change the message for failure in reading changes file?

6.
if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
{
int save_errno = errno;

BufFileClose(fd);
errno = save_errno;
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not write to file \"%s\": %m",

Similar to previous, can we change it to "could not write to streaming
transactions file

7.
@@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid,
bool first_segment)
  * for writing, in append mode.
  */
  if (first_segment)
- flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
- else
- flags = (O_WRONLY | O_APPEND | PG_BINARY);
+ {
+ /*
+ * Shared fileset handle must be allocated in the persistent context.
+ */
+ SharedFileSet *fileset =
+ MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));

- stream_fd = OpenTransientFile(path, flags);
+ PrepareTempTablespaces();
+ SharedFileSetInit(fileset, NULL);

Why are we calling PrepareTempTablespaces here? It is already called
in SharedFileSetInit.

8.
+ /*
+ * Start a transaction on stream start, this transaction will be committed
+ * on the stream stop.  We need the transaction for handling the buffile,
+ * used for serializing the streaming data and subxact info.
+ */
+ ensure_transaction();

I think we need this for PrepareTempTablespaces to set the
temptablespaces.  Also, isn't it required for a cleanup of buffile
resources at the transaction end?  Are there any other reasons for it
as well?  The comment should be a bit more clear for why we need a
transaction here.

9.
* Open a file for streamed changes from a toplevel transaction identified
 * by stream_xid (global variable). If it's the first chunk of streamed
 * changes for this transaction, perform cleanup by removing existing
 * files after a possible previous crash.
..
stream_open_file(Oid subid, TransactionId xid, bool first_segment)

The above part comment atop stream_open_file needs to be changed after
new implementation.

10.
 * enabled.  This context is reeset on each stream stop.
*/
LogicalStreamingContext = AllocSetContextCreate(ApplyContext,

/reeset/reset

11.
stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
{
..
+ /* No entry created for this xid so simply return. */
+ if (ent == NULL)
+ return;
..
}

Is there any reason or scenario where this ent can be NULL?  If not,
it will be better to have an Assert for the same.

12.
subxact_info_write(Oid subid, TransactionId xid)
{
..
+ /*
+ * If there is no subtransaction then nothing to do,  but if already have
+ * subxact file then delete that.
+ */
+ if (nsubxacts == 0)
  {
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create file \"%s\": %m",
- path)));
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ ent->subxact_fileset = NULL;
..
}

Here don't we need to free the subxact_fileset before setting it to NULL?

13.
+ /*
+ * Scan complete hash and delete the underlying files for the the xids.
+ * Also delete the memory for the shared file sets.
+ */

/the the/the.  Instead of "delete the memory", it would be better to
say "release the memory".

14.
+ /*
+ * We might not have created the suxact fileset if there is no sub
+ * transaction.
+ */

/suxact/subxact

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

16 июня 2020 г., 17:19:17

On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think one of the usages we still need is in ReorderBufferForget
> > because it can be called when we skip processing the txn.  See the
> > comments in DecodeCommit where we call this function.  If I am
> > correct, we need to probably collect all invalidations in
> > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> >
>
> One more point related to this is that after this patch series, we
> need to consider executing all invalidation during transaction abort.
> Because it is possible that due to memory overflow, we have processed
> some of the messages which also contain a few XACT_INVALIDATION
> messages, so to avoid cache pollution, we need to execute all of them
> in abort.  We also do the similar thing in Rollback/Rollback To
> Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

I have analyzed this further and I think there is some problem with
that. If Instead of keeping the invalidation as an individual change,
if we try to combine them in ReorderBufferTxn's invalidation then what
happens if the (sub)transaction is aborted.  Basically, in this case,
we will end up executing all those invalidations for those we never
polluted the cache if we never try to stream it.  So this will affect
the normal case where we haven't streamed the transaction because
every time we have executed the invalidation logged by transaction
those are aborted.  One way is we develop the list at the
sub-transaction level and just before sending the transaction (on
commit) combine all the (sub) transaction's invalidation list.  But,
I think since we already have the invalidation in the commit message
then there is no point in adding this complexity.
But, my main worry is about the streaming transaction, the problems are
- Immediately on the arrival of individual invalidation, we can not
directly add to the top-level transaction's invalidation list because
later if the transaction aborted before we stream (or we directly
stream on commit) then we will get an unnecessarily long list of
invalidation which is done by aborted subtransaction.
- If we keep collecting in the individual subtransaction's
ReorderBufferTxn->invalidations,  then the problem is when to merge
it?  I think it is a good idea to merge them all as soon as we try to
stream it/or on commit?  So since this solution of combining the (sub)
transaction's invalidation is required for the streaming case we can
use it as common solution whether it streams due to the memory
overflow or due to the commit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

17 июня 2020 г., 07:02:49

On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > I think one of the usages we still need is in ReorderBufferForget
> > > because it can be called when we skip processing the txn.  See the
> > > comments in DecodeCommit where we call this function.  If I am
> > > correct, we need to probably collect all invalidations in
> > > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> > >
> >
> > One more point related to this is that after this patch series, we
> > need to consider executing all invalidation during transaction abort.
> > Because it is possible that due to memory overflow, we have processed
> > some of the messages which also contain a few XACT_INVALIDATION
> > messages, so to avoid cache pollution, we need to execute all of them
> > in abort.  We also do the similar thing in Rollback/Rollback To
> > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.
>
> I have analyzed this further and I think there is some problem with
> that. If Instead of keeping the invalidation as an individual change,
> if we try to combine them in ReorderBufferTxn's invalidation then what
> happens if the (sub)transaction is aborted.  Basically, in this case,
> we will end up executing all those invalidations for those we never
> polluted the cache if we never try to stream it.  So this will affect
> the normal case where we haven't streamed the transaction because
> every time we have executed the invalidation logged by transaction
> those are aborted.  One way is we develop the list at the
> sub-transaction level and just before sending the transaction (on
> commit) combine all the (sub) transaction's invalidation list.  But,
> I think since we already have the invalidation in the commit message
> then there is no point in adding this complexity.
> But, my main worry is about the streaming transaction, the problems are
> - Immediately on the arrival of individual invalidation, we can not
> directly add to the top-level transaction's invalidation list because
> later if the transaction aborted before we stream (or we directly
> stream on commit) then we will get an unnecessarily long list of
> invalidation which is done by aborted subtransaction.
>

Is there any problem you see with this or you are concerned with the
efficiency?  Please note, we already do something similar in
ReorderBufferForget and if your concern is efficiency then that
applies to existing cases as well.  I think if we want we can improve
it later in many ways and one of them you have already suggested, at
this time, the main thing is correctness and also aborts are not
frequent enough to worry too much about their performance.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

17 июня 2020 г., 07:39:33

On Wed, Jun 17, 2020 at 9:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 16, 2020 at 7:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > I think one of the usages we still need is in ReorderBufferForget
> > > > because it can be called when we skip processing the txn.  See the
> > > > comments in DecodeCommit where we call this function.  If I am
> > > > correct, we need to probably collect all invalidations in
> > > > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > > > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> > > >
> > >
> > > One more point related to this is that after this patch series, we
> > > need to consider executing all invalidation during transaction abort.
> > > Because it is possible that due to memory overflow, we have processed
> > > some of the messages which also contain a few XACT_INVALIDATION
> > > messages, so to avoid cache pollution, we need to execute all of them
> > > in abort.  We also do the similar thing in Rollback/Rollback To
> > > Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.
> >
> > I have analyzed this further and I think there is some problem with
> > that. If Instead of keeping the invalidation as an individual change,
> > if we try to combine them in ReorderBufferTxn's invalidation then what
> > happens if the (sub)transaction is aborted.  Basically, in this case,
> > we will end up executing all those invalidations for those we never
> > polluted the cache if we never try to stream it.  So this will affect
> > the normal case where we haven't streamed the transaction because
> > every time we have executed the invalidation logged by transaction
> > those are aborted.  One way is we develop the list at the
> > sub-transaction level and just before sending the transaction (on
> > commit) combine all the (sub) transaction's invalidation list.  But,
> > I think since we already have the invalidation in the commit message
> > then there is no point in adding this complexity.
> > But, my main worry is about the streaming transaction, the problems are
> > - Immediately on the arrival of individual invalidation, we can not
> > directly add to the top-level transaction's invalidation list because
> > later if the transaction aborted before we stream (or we directly
> > stream on commit) then we will get an unnecessarily long list of
> > invalidation which is done by aborted subtransaction.
> >
>
> Is there any problem you see with this or you are concerned with the
> efficiency?  Please note, we already do something similar in
> ReorderBufferForget and if your concern is efficiency then that
> applies to existing cases as well.  I think if we want we can improve
> it later in many ways and one of them you have already suggested, at
> this time, the main thing is correctness and also aborts are not
> frequent enough to worry too much about their performance.

As of now, I am not seeing the problem, I was just concerned about
processing more invalidation messages in the aborted cases compared to
current code, even if the streaming is off/ or transaction never
streamed as memory size is not crossed.  But, I agree that it is only
in the case of the abort, so I will work on this and later maybe we
can test the performance.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

17 июня 2020 г., 13:26:01

On Tue, Jun 16, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have few more comments on the patch
> > 0013-Change-buffile-interface-required-for-streaming-.patch:
> >
>
> Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
>

changes_filename(char *path, Oid subid, TransactionId xid)
 {
- char tempdirpath[MAXPGPATH];
-
- TempTablespacePath(tempdirpath, DEFAULTTABLESPACE_OID);
-
- /*
- * We might need to create the tablespace's tempfile directory, if no
- * one has yet done so.
- */
- if ((MakePGDirectory(tempdirpath) < 0) && errno != EEXIST)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not create directory \"%s\": %m",
- tempdirpath)));
-
- snprintf(path, MAXPGPATH, "%s/%s%d-%u-%u.changes",
- tempdirpath, PG_TEMP_FILE_PREFIX, MyProcPid, subid, xid);
+ snprintf(path, MAXPGPATH, "%u-%u.changes", subid, xid);

Today, I was studying this change and its impact.  Initially, I
thought that because the patch has removed pgsql_tmp prefix from the
filename, it might create problems if the temporary files remain on
the disk after the crash.  Now as the patch has started using BufFile
interface, it seems to be internally taking care of the same by
generating names like
"base/pgsql_tmp/pgsql_tmp13774.0.sharedfileset/16393-513.changes.0".
Basically, it ensures to create the file in the directory starting
with pgsql_tmp.  I have tried by crashing the server in a situation
where the temp files remain and after the restart, they are removed.
So, it seems okay to generate file names like that but I still suggest
testing other paths like backup where we ignore files whose names
start with PG_TEMP_FILE_PREFIX.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

18 июня 2020 г., 18:31:50

On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jun 7, 2020 at 5:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Jun 5, 2020 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Let me know what you think of the changes?  If you find them okay,
> > > then feel to include them in the next patch-set.
> > >
> > > [1] -
https://www.postgresql.org/message-id/CAONYFtOv%2BEr1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg%40mail.gmail.com
> >
> > Thanks for the patch, I will review it and include it in my next version.

I have merged your changes 0002 in this version.

> Okay, I have done review of
> 0002-Issue-individual-invalidations-with-wal_level-lo.patch and below
> are my comments:
>
> 1. I don't think it is a good idea that logical decoding process the
> new XLOG_XACT_INVALIDATIONS and existing WAL records for invalidations
> like XLOG_INVALIDATIONS and what we do in DecodeCommit (see code in
> the check "if (parsed->nmsgs > 0)").  I think if that is required for
> some particular reason then we should write detailed comments about
> the same.  I have tried some experiments to see if those are really
> required:
> a. After applying patch 0002, I have tried by commenting out the
> processing of invalidations via DecodeCommit and found some regression
> tests were failing but the reason for failure was that we are not
> setting RBTXN_HAS_CATALOG_CHANGES for the toptxn when subtxn has
> catalog changes and when I did that all regression tests started
> passing.  See the attached diff patch
> (v27-0003-Incremental-patch-for-0002-to-test-removal-of-du) atop 0002
> patch.
> b. The processing of invalidations for XLOG_INVALIDATIONS is added by
> commit c6ff84b06a for xid-less transactions.  See
> https://postgr.es/m/CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
> to know why that has been added.  Now, after this patch we will
> process the same invalidations via XLOG_XACT_INVALIDATIONS and
> XLOG_INVALIDATIONS which doesn't seem warranted.  Also, the below
> assertion will fail for xid-less transactions (try create index
> concurrently statement):
> + case XLOG_XACT_INVALIDATIONS:
> + {
> + TransactionId xid;
> + xl_xact_invalidations *invals;
> +
> + xid = XLogRecGetXid(r);
> + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> +
> + Assert(TransactionIdIsValid(xid));
>
> I feel we don't need the processing of XLOG_INVALIDATIONS in logical
> decoding after this patch but to prove that first we need to write a
> test case which need XLOG_INVALIDATIONS in the HEAD as commit
> c6ff84b06a doesn't add one.  I think we need two code paths in
> XLOG_XACT_INVALIDATIONS where if it is for xid-less transactions, then
> execute actions immediately as we are doing in processing of
> XLOG_INVALIDATIONS, otherwise, do what we are doing currently in the
> patch.  If the above point (b) is correct, I am not sure if it is a
> good idea to use RM_XACT_ID as resource manager if for this WAL in
> LogLogicalInvalidations, what do you think?
>
> I think one of the usages we still need is in ReorderBufferForget
> because it can be called when we skip processing the txn.  See the
> comments in DecodeCommit where we call this function.  If I am
> correct, we need to probably collect all invalidations in
> ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> can do the same during processing of XLOG_XACT_INVALIDATIONS.
>
> I had also thought a bit about removing logging of invalidations at
> commit time altogether but it seems processing hot-standby is somewhat
> tightly coupled with existing WAL logging.  See xact_redo_commit (a
> comment atop call to ProcessCommittedInvalidationMessages).  It says
> we need to maintain the order when we process invalidations.  If we
> can later find a way to avoid that we can probably remove it but for
> now maybe we can live with it.

Yes, I have made the changes.  Basically, now I am only using the
XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
are directly appending it to the txn->invalidations.  I have tested
the XLOG_INVALIDATIONS part but while sending this mail I realized
that we could write some automated test for the same.  I will work on
that soon.

> 2.
> + /* not expected, but print something anyway */
> + else if (msg->id == SHAREDINVALSMGR_ID)
> + appendStringInfoString(buf, " smgr");
> + /* not expected, but print something anyway */
> + else if (msg->id == SHAREDINVALRELMAP_ID)
>
> I think the above comment is not valid after we started logging at CCI.

Yup, fixed.

> 3.
> +
> + xid = XLogRecGetXid(r);
> + invals = (xl_xact_invalidations *) XLogRecGetData(r);
> +
> + Assert(TransactionIdIsValid(xid));
> + ReorderBufferAddInvalidation(reorder, xid, buf->origptr,
> + invals->nmsgs, invals->msgs);
>
> Here, it should check !ctx->forward as we do in DecodeCommit, do we
> have any reason for not doing so.  We can test once by changing this.

Yeah, it should have this check.

Mostly it contains changes in 0002,  apart from that we needed some
changes in 0005,0006 to rebase on 0002 and also there is one bug fix
in 0005, basically the txn->snapshot_now was not getting set to NULL
after freeing so it was getting double free.  I have also removed the
extra wait even from the 0014 as BufFile is already logging the wait
event internally and also some changes because BufFileWrite interface
is changed in recent commits.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v28.tar

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

18 июня 2020 г., 18:32:06

On Tue, Jun 9, 2020 at 3:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 8, 2020 at 11:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think one of the usages we still need is in ReorderBufferForget
> > because it can be called when we skip processing the txn.  See the
> > comments in DecodeCommit where we call this function.  If I am
> > correct, we need to probably collect all invalidations in
> > ReorderBufferTxn as we are collecting tuplecids and use them here.  We
> > can do the same during processing of XLOG_XACT_INVALIDATIONS.
> >
>
> One more point related to this is that after this patch series, we
> need to consider executing all invalidation during transaction abort.
> Because it is possible that due to memory overflow, we have processed
> some of the messages which also contain a few XACT_INVALIDATION
> messages, so to avoid cache pollution, we need to execute all of them
> in abort.  We also do the similar thing in Rollback/Rollback To
> Savepoint, see AtEOXact_Inval and AtEOSubXact_Inval.

Yes, we need to do that,  So now we are collecting all the
invalidation under txn->invalidation so they are getting executed on
abort.

>
> Few other comments on
> 0002-Issue-individual-invalidations-with-wal_level-lo.patch
> ---------------------------------------------------------------------------------------------------------------
> 1.
> + if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)
> + {
> + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
> + MakeSharedInvalidMessagesArray);
> + invalMessages = SharedInvalidMessagesArray;
> + nmsgs  = numSharedInvalidMessagesArray;
> + SharedInvalidMessagesArray = NULL;
> + numSharedInvalidMessagesArray = 0;
>
> a. Immediately after ProcessInvalidationMessagesMulti, isn't it better
> to have an Assertion like Assert(!(numSharedInvalidMessagesArray > 0
> && SharedInvalidMessagesArray == NULL));?

Done

> b. Why check "if (transInvalInfo->CurrentCmdInvalidMsgs.cclist)" is
> required?  If you see xactGetCommittedInvalidationMessages where we do
> something similar, we only check for valid value of transInvalInfo and
> here we check the same in the caller of LogLogicalInvalidations, isn't
> that sufficient?  If that is sufficient, we can either have the same
> check here or have an Assert for the same.

I have put the same check here.

>
> 2.
> @@ -1092,6 +1101,9 @@ CommandEndInvalidationMessages(void)
>   if (transInvalInfo == NULL)
>   return;
>
> + if (XLogLogicalInfoActive())
> + LogLogicalInvalidations();
> +
>   ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs,
>   LocalExecuteInvalidationMessage);
> Generally, we WAL log the action after performing it but here you are
> writing WAL first.  Is there any specific reason?  If so, can we write
> a comment about the same?

Yeah, there is no reason for the same so moved it down.

>
> 3.
> + * When wal_level=logical, write invalidations into WAL at each command end to
> + * support the decoding of the in-progress transaction.  As of now it was
> + * enough to log invalidation only at commit because we are only decoding the
> + * transaction at the commit time.   We only need to log the catalog cache and
> + * relcache invalidation.  There can not be any active MVCC scan in logical
> + * decoding so we don't need to log the snapshot invalidation.
>
> I think this comment doesn't hold good after we have changed the patch
> to LOG invalidations at the time of CCI.

Right, modified.

>
> 4.
> +
> +/*
> + * Emit WAL for invalidations.
> + */
> +static void
> +LogLogicalInvalidations()
>
> Add the function name atop of this function in comments to match the
> style with other nearby functions.  How about modifying it to
> something like: "Emit WAL for invalidations.  This is currently only
> used for logging invalidations at the command end."

Done

>
> 5.
> + *
> + * XXX Do we need to care about relcacheInitFileInval and
> + * the other fields added to ReorderBufferChange, or just
> + * about the message itself?
> + */
>
> I don't think we need to do anything about relcacheInitFileInval.
> This is used to remove the stale files (RELCACHE_INIT_FILENAME) that
> have obsolete information about relcache.  The walsender process that
> is doing decoding doesn't require us to do anything about this.  Also,
> if you see before this patch, we don't do anything about relcache
> files during decoding of invalidation messages.  In short, I think we
> can remove this comment unless you see some use of it.

Now, we have removed the Invalidation change itself so this comment is gone.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

22 июня 2020 г., 09:25:49

On Mon, Jun 15, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 9:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Jun 12, 2020 at 4:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > >  Basically, this part is still
> > > > I have to work upon, once we get the consensus then I can remove those
> > > > extra wait event from the patch.
> > > >
> > >
> > > Okay, feel free to send an updated patch with the above change.
> >
> > Sure, I will do that in the next patch set.
> >
>
> I have few more comments on the patch
> 0013-Change-buffile-interface-required-for-streaming-.patch:
>
> 1.
> - * temp_file_limit of the caller, are read-only and are automatically closed
> - * at the end of the transaction but are not deleted on close.
> + * temp_file_limit of the caller, are read-only if the flag is set and are
> + * automatically closed at the end of the transaction but are not deleted on
> + * close.
>   */
>  File
> -PathNameOpenTemporaryFile(const char *path)
> +PathNameOpenTemporaryFile(const char *path, int mode)
>
> No need to say "are read-only if the flag is set".  I don't see any
> flag passed to function so that part of the comment doesn't seem
> appropriate.

Done

> 2.
> @@ -68,7 +68,8 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
>   }
>
>   /* Register our cleanup callback. */
> - on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> + if (seg)
> + on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
>  }
>
> Add comments atop function to explain when we don't want to register
> the dsm detach stuff?

Done,  I am planning to work on more cleaner function for on_proc_exit
as we discussed offlist.  I will work on this in the next version.

> 3.
> + */
> + newFile = file->numFiles - 1;
> + newOffset = FileSize(file->files[file->numFiles - 1]);
>   break;
>
> FileSize can return negative lengths to indicate failure which we
> should handle.

Done

  See other places in the code where FileSize is used?
> But I have another question here which is why we need to implement
> SEEK_END?  How other usages of BufFile interface takes care of this?
> I see an API BufFileTell which can give the current read/write
> location in the file, isn't that sufficient for your usage?  Also, how
> before BufFile usage is this thing handled in the patch?

So far we never supported to open the file in write mode,  only we
create in write mode.  So if we have created the file and its open we
can always use BufFileTell, which will tell the current end location
of the file.  But, once we close and open again it always set to read
from the start of the file as per the current use case.  We need a way
to jump to the end of the last file for appending it.

> 4.
> + /* Loop over all the  files upto the fileno which we want to truncate. */
> + for (i = file->numFiles - 1; i >= fileno; i--)
>
> "the  files", extra space in the above part of the comment.

Fixed

> 5.
> + /*
> + * Except the fileno,  we can directly delete other files.
>
> Before 'we', there is extra space.

Done.

> 6.
> + else
> + {
> + FileTruncate(file->files[i], offset, WAIT_EVENT_BUFFILE_READ);
> + newOffset = offset;
> + }
>
> The wait event passed here doesn't seem to be appropriate.  You might
> want to introduce a new wait event WAIT_EVENT_BUFFILE_TRUNCATE.  Also,
> the error handling for FileTruncate is missing.

Done

> 7.
> + if ((i != fileno || offset == 0) && fileno != 0)
> + {
> + SharedSegmentName(segment_name, file->name, i);
> + SharedFileSetDelete(file->fileset, segment_name, true);
> + newFile--;
> + newOffset = MAX_PHYSICAL_FILESIZE;
> + }
>
> Similar to the previous comment, I think we should handle the failure
> of SharedFileSetDelete.
>
> 8. I think the comments related to BufFile shared API usage need to be
> expanded in the code to explain the new usage.  For ex., see the below
> comments atop buffile.c
> * BufFile supports temporary files that can be made read-only and shared with
> * other backends, as infrastructure for parallel execution.  Such files need
> * to be created as a member of a SharedFileSet that all participants are
> * attached to.

Other fixes (offlist raised by my colleague Neha Sharma)
1. In BufFileTruncateShared, the files were not closed before
deleting.  (in 0013)
2. In apply_handle_stream_commit, the file name in debug message was
printed before populating the name (0014)
3. On concurrent abort we are truncating all the changes including
some incomplete changes,  so later when we get the complete changes we
don't have the previous changes,  e.g, if we had specinsert in the
last stream and due to concurrent abort detection if we delete that
changes later we will get spec_confirm without spec insert.  We could
have simply avoided deleting all the changes, but I think the better
fix is once we detect the concurrent abort for any transaction, then
why do we need to collect the changes for that, we can simply avoid
that.  So I have put that fix. (0006)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v29.tar

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

22 июня 2020 г., 09:26:04

On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have few more comments on the patch
> > 0013-Change-buffile-interface-required-for-streaming-.patch:
> >
>
> Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
> 1.
> The subxact file is only create if there
> + * are any suxact info under this xid.
> + */
> +typedef struct StreamXidHash
>
> Lets slightly reword the part of the comment as "The subxact file is
> created iff there is any suxact info under this xid."

Done

>
> 2.
> @@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
>   subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
>   stream_close_file();
>
> + /* Commit the per-stream transaction */
> + CommitTransactionCommand();
>
> Before calling commit, ensure that we are in a valid transaction.  I
> think we can have an Assert for IsTransactionState().

Done

> 3.
> @@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)
>
>   int64 i;
>   int64 subidx;
> - int fd;
> + BufFile    *fd;
>   bool found = false;
>   char path[MAXPGPATH];
> + StreamXidHash *ent;
>
>   subidx = -1;
> + ensure_transaction();
>   subxact_info_read(MyLogicalRepWorker->subid, xid);
>
> Why to call ensure_transaction here?  Is there any reason that we
> won't have a valid transaction by now?  If not, then its better to
> have an Assert for IsTransactionState().

We are only starting transaction from stream_start to stream_stop,  so
at stream_abort we will not have the transaction.

> 4.
> - if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
> + if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
>   {
> - int save_errno = errno;
> + int save_errno = errno;
>
> - CloseTransientFile(fd);
> + BufFileClose(fd);
>
> On error, won't these files be close automatically?  If so, why at
> this place and before other errors, we need to close this?

Yes, that's correct.  I have fixed those.

> 5.
> if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
> {
> int save_errno = errno;
>
> BufFileClose(fd);
> errno = save_errno;
> ereport(ERROR,
> (errcode_for_file_access(),
> errmsg("could not read file \"%s\": %m",
>
> Can we change the error message to "could not read from streaming
> transactions file .." or something like that and similarly we can
> change the message for failure in reading changes file?

Done


> 6.
> if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
> {
> int save_errno = errno;
>
> BufFileClose(fd);
> errno = save_errno;
> ereport(ERROR,
> (errcode_for_file_access(),
> errmsg("could not write to file \"%s\": %m",
>
> Similar to previous, can we change it to "could not write to streaming
> transactions file

BufFileWrite is not returning failure anymore.

> 7.
> @@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid,
> bool first_segment)
>   * for writing, in append mode.
>   */
>   if (first_segment)
> - flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
> - else
> - flags = (O_WRONLY | O_APPEND | PG_BINARY);
> + {
> + /*
> + * Shared fileset handle must be allocated in the persistent context.
> + */
> + SharedFileSet *fileset =
> + MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
>
> - stream_fd = OpenTransientFile(path, flags);
> + PrepareTempTablespaces();
> + SharedFileSetInit(fileset, NULL);
>
> Why are we calling PrepareTempTablespaces here? It is already called
> in SharedFileSetInit.

My bad,  First I tired using SharedFileSetInit but later it got
changed for forgot to remove this part.

> 8.
> + /*
> + * Start a transaction on stream start, this transaction will be committed
> + * on the stream stop.  We need the transaction for handling the buffile,
> + * used for serializing the streaming data and subxact info.
> + */
> + ensure_transaction();
>
> I think we need this for PrepareTempTablespaces to set the
> temptablespaces.  Also, isn't it required for a cleanup of buffile
> resources at the transaction end?  Are there any other reasons for it
> as well?  The comment should be a bit more clear for why we need a
> transaction here.

I am not sure that will it make sense to add a comment here that why
buffile and sharedfileset need a transaction?  Do you think that we
should add comment in buffile/shared fileset API that it should be
called under a transaction?

> 9.
> * Open a file for streamed changes from a toplevel transaction identified
>  * by stream_xid (global variable). If it's the first chunk of streamed
>  * changes for this transaction, perform cleanup by removing existing
>  * files after a possible previous crash.
> ..
> stream_open_file(Oid subid, TransactionId xid, bool first_segment)
>
> The above part comment atop stream_open_file needs to be changed after
> new implementation.

Done

> 10.
>  * enabled.  This context is reeset on each stream stop.
> */
> LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
>
> /reeset/reset

Done


> 11.
> stream_cleanup_files(Oid subid, TransactionId xid, bool missing_ok)
> {
> ..
> + /* No entry created for this xid so simply return. */
> + if (ent == NULL)
> + return;
> ..
> }
>
> Is there any reason or scenario where this ent can be NULL?  If not,
> it will be better to have an Assert for the same.

Right, it should be an assert, even if all the changes are ignored for
the top transaction, we should have sent the stream_start.

> 12.
> subxact_info_write(Oid subid, TransactionId xid)
> {
> ..
> + /*
> + * If there is no subtransaction then nothing to do,  but if already have
> + * subxact file then delete that.
> + */
> + if (nsubxacts == 0)
>   {
> - ereport(ERROR,
> - (errcode_for_file_access(),
> - errmsg("could not create file \"%s\": %m",
> - path)));
> + if (ent->subxact_fileset)
> + {
> + cleanup_subxact_info();
> + BufFileDeleteShared(ent->subxact_fileset, path);
> + ent->subxact_fileset = NULL;
> ..
> }
>
> Here don't we need to free the subxact_fileset before setting it to NULL?

Yes, done

> 13.
> + /*
> + * Scan complete hash and delete the underlying files for the the xids.
> + * Also delete the memory for the shared file sets.
> + */
>
> /the the/the.  Instead of "delete the memory", it would be better to
> say "release the memory".

Done

>
> 14.
> + /*
> + * We might not have created the suxact fileset if there is no sub
> + * transaction.
> + */
>
> /suxact/subxact
Done




--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

22 июня 2020 г., 13:56:44

On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Yes, I have made the changes.  Basically, now I am only using the
> XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
> So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
> are directly appending it to the txn->invalidations.  I have tested
> the XLOG_INVALIDATIONS part but while sending this mail I realized
> that we could write some automated test for the same.
>

Can you share how you have tested it?

>  I will work on
> that soon.
>

Cool, I think having a regression test for this will be a good idea.

@@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
TransactionId xid, XLogRecPtr lsn)
  if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
  ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
     txn->invalidations);
- else
- Assert(txn->ninvalidations == 0);

Why this Assert is removed?

Apart from above, I have made a number of changes in
0002-WAL-Log-invalidations-at-command-end-with-wal_le to remove some
unnecessary changes, edited comments, ran pgindent and updated the
commit message.  If you are fine with these changes, then do include
them in your next version.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Tue, Jun 23, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jun 22, 2020 at 6:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jun 22, 2020 at 5:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > > @@ -2012,8 +2014,6 @@ ReorderBufferForget(ReorderBuffer *rb,
> > > > > > TransactionId xid, XLogRecPtr lsn)
> > > > > >   if (txn->base_snapshot != NULL && txn->ninvalidations > 0)
> > > > > >   ReorderBufferImmediateInvalidation(rb, txn->ninvalidations,
> > > > > >      txn->invalidations);
> > > > > > - else
> > > > > > - Assert(txn->ninvalidations == 0);
> > > > > >
> > > > > > Why this Assert is removed?
> > > > >
> > > > > Even if the base_snapshot is NULL, now we are collecting the
> > > > > txn->invalidation.
> > > > >
> > > >
> > > > But there doesn't seem to be any check even before this patch which
> > > > directly prohibits accumulating invalidations in DecodeCommit.  We
> > > > have check for base_snapshot in ReorderBufferCommit.  Did you get any
> > > > failure with that check?
> > >
> > > Because earlier ReorderBufferForget for toptxn will be called if the
> > > top transaction is aborted and in abort case, we are not logging any
> > > invalidation so that will be 0.  However same is not true now.
> > >
> >
> > AFAICS, ReorderBufferForget() is called (via DecodeCommit) only when
> > we need to skip the transaction.  It doesn't seem to be called from
> > Abort path (DecodeAbort/ReorderBufferAbort doesn't use
> > ReorderBufferForget).  I am not sure which code path are you referring
> > here, can you please share the code flow which you are referring to
> > here.
>
> I think you are right,  during some intermediate code change, it
> crashed on that assert (I guess I might be adding invalidation to the
> sub-transaction but not sure what was that state) and I assumed that
> is the reason that I explained above but, now I see my assumption was
> wrong.  I will put back that assert.  By testing, I could not hit any
> case where we hit that assert even after my changes, still I will put
> more thought if by any chance our case is different then the base
> code.

Here is the POC patch to discuss the idea of a cleanup of shared
fileset on proc exit.  As discussed offlist,  here I am maintaining
the list of shared fileset.  First time when the list is NULL I am
registering the cleanup function with on_proc_exit routine.  After
that for subsequent fileset, I am just appending it to filesetlist.
There is also an interface to unregister the shared file set from the
cleanup list and that is done by the caller whenever we are deleting
the shared fileset manually.  While explaining it here, I think there
could be one issue if we delete all the element from the list will
become NULL and on next SharedFileSetInit we will again register the
function.  Maybe that is not a problem but we can avoid registering
multiple times by using some flag in the file

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

poc_shared_fileset_cleanup_on_procexit.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

24 июня 2020 г., 13:34:12

On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Here is the POC patch to discuss the idea of a cleanup of shared
> fileset on proc exit.  As discussed offlist,  here I am maintaining
> the list of shared fileset.  First time when the list is NULL I am
> registering the cleanup function with on_proc_exit routine.  After
> that for subsequent fileset, I am just appending it to filesetlist.
> There is also an interface to unregister the shared file set from the
> cleanup list and that is done by the caller whenever we are deleting
> the shared fileset manually.  While explaining it here, I think there
> could be one issue if we delete all the element from the list will
> become NULL and on next SharedFileSetInit we will again register the
> function.  Maybe that is not a problem but we can avoid registering
> multiple times by using some flag in the file
>

I don't understand what you mean by "using some flag in the file".

Review comments on various patches.

poc_shared_fileset_cleanup_on_procexit
=================================
1.
- ent->subxact_fileset =
- MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
+ MemoryContext oldctx;

+ /* Shared fileset handle must be allocated in the persistent context */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+ ent->subxact_fileset = palloc(sizeof(SharedFileSet));
  SharedFileSetInit(ent->subxact_fileset, NULL);
+ MemoryContextSwitchTo(oldctx);
  fd = BufFileCreateShared(ent->subxact_fileset, path);

Why is this change required for this patch and why we only cover
SharedFileSetInit in the Apply context and not BufFileCreateShared?
The comment is also not very clear on this point.

2.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
+ bool found = false;
+ ListCell *l;
+
+ Assert(filesetlist != NULL);
+
+ /* Loop over all the pending shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+
+ /* remove the entry from the list and delete the underlying files */
+ if (input_fileset->number == fileset->number)
+ {
+ SharedFileSetDeleteAll(fileset);
+ filesetlist = list_delete_cell(filesetlist, l);

Why are we calling SharedFileSetDeleteAll here when in the caller we
have already deleted the fileset as per below code?
BufFileDeleteShared(ent->stream_fileset, path);
+ SharedFileSetUnregister(ent->stream_fileset);

I think it will be good if somehow we can remove the fileset from
filesetlist during BufFileDeleteShared.  If that is possible, then we
don't need a separate API for SharedFileSetUnregister.

3.
+static List * filesetlist = NULL;
+
 static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
+static void SharedFileSetOnProcExit(int status, Datum arg);
 static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
tablespace);
 static void SharedFilePath(char *path, SharedFileSet *fileset, const
char *name);
 static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);

We use NIL for list initialization and comparison.  See lock_files usage.

4.
+SharedFileSetOnProcExit(int status, Datum arg)
+{
+ ListCell *l;
+
+ /* Loop over all the pending  shared fileset entry */
+ foreach (l, filesetlist)
+ {
+ SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
+ SharedFileSetDeleteAll(fileset);
+ }

We can initialize filesetlist as NIL after the for loop as it will
make the code look clean.

Comments on other patches:
=========================
5.
> 3. On concurrent abort we are truncating all the changes including
> some incomplete changes,  so later when we get the complete changes we
> don't have the previous changes,  e.g, if we had specinsert in the
> last stream and due to concurrent abort detection if we delete that
> changes later we will get spec_confirm without spec insert.  We could
> have simply avoided deleting all the changes, but I think the better
> fix is once we detect the concurrent abort for any transaction, then
> why do we need to collect the changes for that, we can simply avoid
> that.  So I have put that fix. (0006)
>

On similar lines, I think we need to skip processing message, see else
part of code in ReorderBufferQueueMessage.

6.
In v29-0002-Issue-individual-invalidations-with-wal_level-lo,
xact_desc_invalidations seems to be a subset of
standby_desc_invalidations, can we have a common code for them?

7.
I think we can avoid sending v29-0007-Track-statistics-for-streaming
this each time.  We can do this after the main patch is complete.
Also, we might need to change how and where these stats will be
tracked.  See the related discussion [1].

8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer,
  * Return oldest transaction in reorderbuffer
@@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb,
TransactionId xid,
  /* set the reference to top-level transaction */
  subtxn->toptxn = txn;

+ /* set the reference to toplevel transaction */
+ subtxn->toptxn = txn;
+

There is a double initialization of subtxn->toptxn.  You need to
remove this line from 0005 patch as we have now added it in an earlier
patch.

9.  I think you forgot to update the patch to execute invalidations in
Abort case or I might be missing something.  I don't see any changes
in ReorderBufferAbort. You have agreed in one of the emails above [2]
about handling the same.

10. In v29-0008-Add-support-for-streaming-to-built-in-replicatio,
 apply_handle_stream_commit(StringInfo s)
 {
 ..
 + /*
 + * send feedback to upstream
 + *
 + * XXX Probably should send a valid LSN. But which one?
 + */
 + send_feedback(InvalidXLogRecPtr, false, false);
 ..
 }

I have given a comment on this code that we don't need this feedback
and you mentioned on June 02 [3] that you will think on it and let me
know your opinion but I don't see a response from you yet.  Can you
get back to me regarding this point?

11. Add some comments as to why we have used Shared BufFile interface
instead of Temp BufFile interface?

12. In v29-0013-Change-buffile-interface-required-for-streaming,
+ * Initialize a space for temporary files that can be opened other backends.

/opened other backends/opened for access by other backends

[1] - https://www.postgresql.org/message-id/CA%2Bfd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAFiTN-t7WZZjFrAjSYj4fu%3DFZ2JKENN8ZHCUZaw-srnrHMWMrg%40mail.gmail.com
[3] - https://www.postgresql.org/message-id/CAFiTN-tHpd%2BzXVemo9WqQUJS50p9m8jD%3DAWjsugKZQ4F-K8Pbw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

24 июня 2020 г., 13:40:13

On Mon, Jun 22, 2020 at 11:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 8.
> > + /*
> > + * Start a transaction on stream start, this transaction will be committed
> > + * on the stream stop.  We need the transaction for handling the buffile,
> > + * used for serializing the streaming data and subxact info.
> > + */
> > + ensure_transaction();
> >
> > I think we need this for PrepareTempTablespaces to set the
> > temptablespaces.  Also, isn't it required for a cleanup of buffile
> > resources at the transaction end?  Are there any other reasons for it
> > as well?  The comment should be a bit more clear for why we need a
> > transaction here.
>
> I am not sure that will it make sense to add a comment here that why
> buffile and sharedfileset need a transaction?
>

You can say usage of BufFile interface expects us to be in the
transaction for so and so reason....

  Do you think that we
> should add comment in buffile/shared fileset API that it should be
> called under a transaction?
>

I am fine with that as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

24 июня 2020 г., 13:57:41

 iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Here is the POC patch to discuss the idea of a cleanup of shared
> > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > the list of shared fileset.  First time when the list is NULL I am
> > registering the cleanup function with on_proc_exit routine.  After
> > that for subsequent fileset, I am just appending it to filesetlist.
> > There is also an interface to unregister the shared file set from the
> > cleanup list and that is done by the caller whenever we are deleting
> > the shared fileset manually.  While explaining it here, I think there
> > could be one issue if we delete all the element from the list will
> > become NULL and on next SharedFileSetInit we will again register the
> > function.  Maybe that is not a problem but we can avoid registering
> > multiple times by using some flag in the file
> >
>
> I don't understand what you mean by "using some flag in the file".

Basically, in POC as shown in below code snippet,  We are checking
that if the "filesetlist" is NULL then only register the on_proc_exit
function.  But, as described above if all the items are deleted the
list will be NULL.  So I told that instead of checking the filesetlist
is NULL,  we can have just a boolean variable that if we have
registered the callback then don't do it again.

@@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
  /* Register our cleanup callback. */
  if (seg)
  on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+ else
+ {
+ if (filesetlist == NULL)
+ on_proc_exit(SharedFileSetOnProcExit, 0);
+
+ filesetlist = lcons((void *)fileset, filesetlist);
+ }
 }

>
> Review comments on various patches.
>
> poc_shared_fileset_cleanup_on_procexit
> =================================
> 1.
> - ent->subxact_fileset =
> - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> + MemoryContext oldctx;
>
> + /* Shared fileset handle must be allocated in the persistent context */
> + oldctx = MemoryContextSwitchTo(ApplyContext);
> + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
>   SharedFileSetInit(ent->subxact_fileset, NULL);
> + MemoryContextSwitchTo(oldctx);
>   fd = BufFileCreateShared(ent->subxact_fileset, path);
>
> Why is this change required for this patch and why we only cover
> SharedFileSetInit in the Apply context and not BufFileCreateShared?
> The comment is also not very clear on this point.

Because only the sharedfileset and the filesetlist which is allocated
under SharedFileSetInit, are required in the permanent context.
BufFileCreateShared, only creates the Buffile and VFD which will be
required only within the current stream so transaction context is
enough.

> 2.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> + bool found = false;
> + ListCell *l;
> +
> + Assert(filesetlist != NULL);
> +
> + /* Loop over all the pending shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> +
> + /* remove the entry from the list and delete the underlying files */
> + if (input_fileset->number == fileset->number)
> + {
> + SharedFileSetDeleteAll(fileset);
> + filesetlist = list_delete_cell(filesetlist, l);
>
> Why are we calling SharedFileSetDeleteAll here when in the caller we
> have already deleted the fileset as per below code?
> BufFileDeleteShared(ent->stream_fileset, path);
> + SharedFileSetUnregister(ent->stream_fileset);
>
> I think it will be good if somehow we can remove the fileset from
> filesetlist during BufFileDeleteShared.  If that is possible, then we
> don't need a separate API for SharedFileSetUnregister.

But the filesetlist is maintained at the sharedfileset level, so even
if we delete from BufFileDeleteShared, we need to call an API from the
sharedfileset layer to unregister the fileset.  Am I missing
something?

> 3.
> +static List * filesetlist = NULL;
> +
>  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> +static void SharedFileSetOnProcExit(int status, Datum arg);
>  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> tablespace);
>  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> char *name);
>  static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
> @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
>   /* Register our cleanup callback. */
>   if (seg)
>   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> + else
> + {
> + if (filesetlist == NULL)
> + on_proc_exit(SharedFileSetOnProcExit, 0);
>
> We use NIL for list initialization and comparison.  See lock_files usage.

Right.

> 4.
> +SharedFileSetOnProcExit(int status, Datum arg)
> +{
> + ListCell *l;
> +
> + /* Loop over all the pending  shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> + SharedFileSetDeleteAll(fileset);
> + }
>
> We can initialize filesetlist as NIL after the for loop as it will
> make the code look clean.

ok

Thanks for your feedback on this.  I will reply to other comments separately.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

24 июня 2020 г., 14:08:34

On Wed, Jun 24, 2020 at 4:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>  iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Here is the POC patch to discuss the idea of a cleanup of shared
> > > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > > the list of shared fileset.  First time when the list is NULL I am
> > > registering the cleanup function with on_proc_exit routine.  After
> > > that for subsequent fileset, I am just appending it to filesetlist.
> > > There is also an interface to unregister the shared file set from the
> > > cleanup list and that is done by the caller whenever we are deleting
> > > the shared fileset manually.  While explaining it here, I think there
> > > could be one issue if we delete all the element from the list will
> > > become NULL and on next SharedFileSetInit we will again register the
> > > function.  Maybe that is not a problem but we can avoid registering
> > > multiple times by using some flag in the file
> > >
> >
> > I don't understand what you mean by "using some flag in the file".
>
> Basically, in POC as shown in below code snippet,  We are checking
> that if the "filesetlist" is NULL then only register the on_proc_exit
> function.  But, as described above if all the items are deleted the
> list will be NULL.  So I told that instead of checking the filesetlist
> is NULL,  we can have just a boolean variable that if we have
> registered the callback then don't do it again.
>

Check if there is any precedent of the same in the code?

>
> >
> > Review comments on various patches.
> >
> > poc_shared_fileset_cleanup_on_procexit
> > =================================
> > 1.
> > - ent->subxact_fileset =
> > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > + MemoryContext oldctx;
> >
> > + /* Shared fileset handle must be allocated in the persistent context */
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > + MemoryContextSwitchTo(oldctx);
> >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> >
> > Why is this change required for this patch and why we only cover
> > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > The comment is also not very clear on this point.
>
> Because only the sharedfileset and the filesetlist which is allocated
> under SharedFileSetInit, are required in the permanent context.
> BufFileCreateShared, only creates the Buffile and VFD which will be
> required only within the current stream so transaction context is
> enough.
>

Okay, then add some more comments to explain it or if you have
explained it elsewhere, then add a reference for the same.

> > 2.
> > +void
> > +SharedFileSetUnregister(SharedFileSet *input_fileset)
> > +{
> > + bool found = false;
> > + ListCell *l;
> > +
> > + Assert(filesetlist != NULL);
> > +
> > + /* Loop over all the pending shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > +
> > + /* remove the entry from the list and delete the underlying files */
> > + if (input_fileset->number == fileset->number)
> > + {
> > + SharedFileSetDeleteAll(fileset);
> > + filesetlist = list_delete_cell(filesetlist, l);
> >
> > Why are we calling SharedFileSetDeleteAll here when in the caller we
> > have already deleted the fileset as per below code?
> > BufFileDeleteShared(ent->stream_fileset, path);
> > + SharedFileSetUnregister(ent->stream_fileset);
> >
> > I think it will be good if somehow we can remove the fileset from
> > filesetlist during BufFileDeleteShared.  If that is possible, then we
> > don't need a separate API for SharedFileSetUnregister.
>
> But the filesetlist is maintained at the sharedfileset level, so even
> if we delete from BufFileDeleteShared, we need to call an API from the
> sharedfileset layer to unregister the fileset.
>

Sure, but isn't it better if we can call such an API from BufFileDeleteShared?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

25 июня 2020 г., 16:40:45

On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Here is the POC patch to discuss the idea of a cleanup of shared
> > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > the list of shared fileset.  First time when the list is NULL I am
> > registering the cleanup function with on_proc_exit routine.  After
> > that for subsequent fileset, I am just appending it to filesetlist.
> > There is also an interface to unregister the shared file set from the
> > cleanup list and that is done by the caller whenever we are deleting
> > the shared fileset manually.  While explaining it here, I think there
> > could be one issue if we delete all the element from the list will
> > become NULL and on next SharedFileSetInit we will again register the
> > function.  Maybe that is not a problem but we can avoid registering
> > multiple times by using some flag in the file
> >
>
> I don't understand what you mean by "using some flag in the file".
>
> Review comments on various patches.
>
> poc_shared_fileset_cleanup_on_procexit
> =================================
> 1.
> - ent->subxact_fileset =
> - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> + MemoryContext oldctx;
>
> + /* Shared fileset handle must be allocated in the persistent context */
> + oldctx = MemoryContextSwitchTo(ApplyContext);
> + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
>   SharedFileSetInit(ent->subxact_fileset, NULL);
> + MemoryContextSwitchTo(oldctx);
>   fd = BufFileCreateShared(ent->subxact_fileset, path);
>
> Why is this change required for this patch and why we only cover
> SharedFileSetInit in the Apply context and not BufFileCreateShared?
> The comment is also not very clear on this point.

Added the comments for the same.

> 2.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> + bool found = false;
> + ListCell *l;
> +
> + Assert(filesetlist != NULL);
> +
> + /* Loop over all the pending shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> +
> + /* remove the entry from the list and delete the underlying files */
> + if (input_fileset->number == fileset->number)
> + {
> + SharedFileSetDeleteAll(fileset);
> + filesetlist = list_delete_cell(filesetlist, l);
>
> Why are we calling SharedFileSetDeleteAll here when in the caller we
> have already deleted the fileset as per below code?
> BufFileDeleteShared(ent->stream_fileset, path);
> + SharedFileSetUnregister(ent->stream_fileset);

That's wrong I have removed this.


> I think it will be good if somehow we can remove the fileset from
> filesetlist during BufFileDeleteShared.  If that is possible, then we
> don't need a separate API for SharedFileSetUnregister.

I have done as discussed on later replies, basically called
SharedFileSetUnregister from BufFileDeleteShared.

> 3.
> +static List * filesetlist = NULL;
> +
>  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> +static void SharedFileSetOnProcExit(int status, Datum arg);
>  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> tablespace);
>  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> char *name);
>  static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
> @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
>   /* Register our cleanup callback. */
>   if (seg)
>   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> + else
> + {
> + if (filesetlist == NULL)
> + on_proc_exit(SharedFileSetOnProcExit, 0);
>
> We use NIL for list initialization and comparison.  See lock_files usage.

Done

> 4.
> +SharedFileSetOnProcExit(int status, Datum arg)
> +{
> + ListCell *l;
> +
> + /* Loop over all the pending  shared fileset entry */
> + foreach (l, filesetlist)
> + {
> + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> + SharedFileSetDeleteAll(fileset);
> + }
>
> We can initialize filesetlist as NIL after the for loop as it will
> make the code look clean.

Right.

> Comments on other patches:
> =========================
> 5.
> > 3. On concurrent abort we are truncating all the changes including
> > some incomplete changes,  so later when we get the complete changes we
> > don't have the previous changes,  e.g, if we had specinsert in the
> > last stream and due to concurrent abort detection if we delete that
> > changes later we will get spec_confirm without spec insert.  We could
> > have simply avoided deleting all the changes, but I think the better
> > fix is once we detect the concurrent abort for any transaction, then
> > why do we need to collect the changes for that, we can simply avoid
> > that.  So I have put that fix. (0006)
> >
>
> On similar lines, I think we need to skip processing message, see else
> part of code in ReorderBufferQueueMessage.

Basically, ReorderBufferQueueMessage also calls the
ReorderBufferQueueChange internally for transactional changes.  But,
having said that, I realize the idea of skipping the changes in
ReorderBufferQueueChange is not good,  because by then we have already
allocated the memory for the change and the tuple and it's not a
correct to ReturnChanges because it will update the memory accounting.
So I think we can do it at a more centralized place and before we
process the change,  maybe in LogicalDecodingProcessRecord, before
going to the switch we can call a function from the reorderbuffer.c
layer to see whether this transaction is detected as aborted or not.
But I have to think more on this line that can we skip all the
processing of that record or not.

Your other comments look fine to me so I will send in the next patch
set and reply on them individually.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

poc_shared_fileset_cleanup_on_procexit_v1.patch

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

26 июня 2020 г., 08:09:29

On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Here is the POC patch to discuss the idea of a cleanup of shared
> > > fileset on proc exit.  As discussed offlist,  here I am maintaining
> > > the list of shared fileset.  First time when the list is NULL I am
> > > registering the cleanup function with on_proc_exit routine.  After
> > > that for subsequent fileset, I am just appending it to filesetlist.
> > > There is also an interface to unregister the shared file set from the
> > > cleanup list and that is done by the caller whenever we are deleting
> > > the shared fileset manually.  While explaining it here, I think there
> > > could be one issue if we delete all the element from the list will
> > > become NULL and on next SharedFileSetInit we will again register the
> > > function.  Maybe that is not a problem but we can avoid registering
> > > multiple times by using some flag in the file
> > >
> >
> > I don't understand what you mean by "using some flag in the file".
> >
> > Review comments on various patches.
> >
> > poc_shared_fileset_cleanup_on_procexit
> > =================================
> > 1.
> > - ent->subxact_fileset =
> > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > + MemoryContext oldctx;
> >
> > + /* Shared fileset handle must be allocated in the persistent context */
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > + MemoryContextSwitchTo(oldctx);
> >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> >
> > Why is this change required for this patch and why we only cover
> > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > The comment is also not very clear on this point.
>
> Added the comments for the same.
>
> > 2.
> > +void
> > +SharedFileSetUnregister(SharedFileSet *input_fileset)
> > +{
> > + bool found = false;
> > + ListCell *l;
> > +
> > + Assert(filesetlist != NULL);
> > +
> > + /* Loop over all the pending shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > +
> > + /* remove the entry from the list and delete the underlying files */
> > + if (input_fileset->number == fileset->number)
> > + {
> > + SharedFileSetDeleteAll(fileset);
> > + filesetlist = list_delete_cell(filesetlist, l);
> >
> > Why are we calling SharedFileSetDeleteAll here when in the caller we
> > have already deleted the fileset as per below code?
> > BufFileDeleteShared(ent->stream_fileset, path);
> > + SharedFileSetUnregister(ent->stream_fileset);
>
> That's wrong I have removed this.
>
>
> > I think it will be good if somehow we can remove the fileset from
> > filesetlist during BufFileDeleteShared.  If that is possible, then we
> > don't need a separate API for SharedFileSetUnregister.
>
> I have done as discussed on later replies, basically called
> SharedFileSetUnregister from BufFileDeleteShared.
>
> > 3.
> > +static List * filesetlist = NULL;
> > +
> >  static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum);
> > +static void SharedFileSetOnProcExit(int status, Datum arg);
> >  static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid
> > tablespace);
> >  static void SharedFilePath(char *path, SharedFileSet *fileset, const
> > char *name);
> >  static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name);
> > @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
> >   /* Register our cleanup callback. */
> >   if (seg)
> >   on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
> > + else
> > + {
> > + if (filesetlist == NULL)
> > + on_proc_exit(SharedFileSetOnProcExit, 0);
> >
> > We use NIL for list initialization and comparison.  See lock_files usage.
>
> Done
>
> > 4.
> > +SharedFileSetOnProcExit(int status, Datum arg)
> > +{
> > + ListCell *l;
> > +
> > + /* Loop over all the pending  shared fileset entry */
> > + foreach (l, filesetlist)
> > + {
> > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l);
> > + SharedFileSetDeleteAll(fileset);
> > + }
> >
> > We can initialize filesetlist as NIL after the for loop as it will
> > make the code look clean.
>
> Right.
>
> > Comments on other patches:
> > =========================
> > 5.
> > > 3. On concurrent abort we are truncating all the changes including
> > > some incomplete changes,  so later when we get the complete changes we
> > > don't have the previous changes,  e.g, if we had specinsert in the
> > > last stream and due to concurrent abort detection if we delete that
> > > changes later we will get spec_confirm without spec insert.  We could
> > > have simply avoided deleting all the changes, but I think the better
> > > fix is once we detect the concurrent abort for any transaction, then
> > > why do we need to collect the changes for that, we can simply avoid
> > > that.  So I have put that fix. (0006)
> > >
> >
> > On similar lines, I think we need to skip processing message, see else
> > part of code in ReorderBufferQueueMessage.
>
> Basically, ReorderBufferQueueMessage also calls the
> ReorderBufferQueueChange internally for transactional changes.  But,
> having said that, I realize the idea of skipping the changes in
> ReorderBufferQueueChange is not good,  because by then we have already
> allocated the memory for the change and the tuple and it's not a
> correct to ReturnChanges because it will update the memory accounting.
> So I think we can do it at a more centralized place and before we
> process the change,  maybe in LogicalDecodingProcessRecord, before
> going to the switch we can call a function from the reorderbuffer.c
> layer to see whether this transaction is detected as aborted or not.
> But I have to think more on this line that can we skip all the
> processing of that record or not.
>
> Your other comments look fine to me so I will send in the next patch
> set and reply on them individually.

I think we can not put this check, in the higher-level functions like
LogicalDecodingProcessRecord or DecodeXXXOp because we need to process
that xid at least for abort,  so I think it is good to keep the check,
inside ReorderBufferQueueChange only and we can free the memory of the
change if the abort is detected.  Also, if just skip those changes in
ReorderBufferQueueChange then the effect will be localized to that
particular transaction which is already aborted.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

26 июня 2020 г., 08:51:17

On Fri, Jun 26, 2020 at 10:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Comments on other patches:
> > > =========================
> > > 5.
> > > > 3. On concurrent abort we are truncating all the changes including
> > > > some incomplete changes,  so later when we get the complete changes we
> > > > don't have the previous changes,  e.g, if we had specinsert in the
> > > > last stream and due to concurrent abort detection if we delete that
> > > > changes later we will get spec_confirm without spec insert.  We could
> > > > have simply avoided deleting all the changes, but I think the better
> > > > fix is once we detect the concurrent abort for any transaction, then
> > > > why do we need to collect the changes for that, we can simply avoid
> > > > that.  So I have put that fix. (0006)
> > > >
> > >
> > > On similar lines, I think we need to skip processing message, see else
> > > part of code in ReorderBufferQueueMessage.
> >
> > Basically, ReorderBufferQueueMessage also calls the
> > ReorderBufferQueueChange internally for transactional changes.

Yes, that is correct but I was thinking about the non-transactional
part due to the below code there.

else
{
ReorderBufferTXN *txn = NULL;
volatile Snapshot snapshot_now = snapshot;

if (xid != InvalidTransactionId)
txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);

Even though we are using txn here but I think we don't need to skip it
for aborted xacts because without patch as well such messages get
decoded irrespective of transaction status.  What do you think?

> >  But,
> > having said that, I realize the idea of skipping the changes in
> > ReorderBufferQueueChange is not good,  because by then we have already
> > allocated the memory for the change and the tuple and it's not a
> > correct to ReturnChanges because it will update the memory accounting.
> > So I think we can do it at a more centralized place and before we
> > process the change,  maybe in LogicalDecodingProcessRecord, before
> > going to the switch we can call a function from the reorderbuffer.c
> > layer to see whether this transaction is detected as aborted or not.
> > But I have to think more on this line that can we skip all the
> > processing of that record or not.
> >
> > Your other comments look fine to me so I will send in the next patch
> > set and reply on them individually.
>
> I think we can not put this check, in the higher-level functions like
> LogicalDecodingProcessRecord or DecodeXXXOp because we need to process
> that xid at least for abort,  so I think it is good to keep the check,
> inside ReorderBufferQueueChange only and we can free the memory of the
> change if the abort is detected.  Also, if just skip those changes in
> ReorderBufferQueueChange then the effect will be localized to that
> particular transaction which is already aborted.
>

Fair enough and for cases like non-transactional part of
ReorderBufferQueueMessage, I think we anyway need to process the
message irrespective of transaction status.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Amit Kapila

Дата:

26 июня 2020 г., 09:17:42

On Thu, Jun 25, 2020 at 7:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Review comments on various patches.
> >
> > poc_shared_fileset_cleanup_on_procexit
> > =================================
> > 1.
> > - ent->subxact_fileset =
> > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > + MemoryContext oldctx;
> >
> > + /* Shared fileset handle must be allocated in the persistent context */
> > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > + MemoryContextSwitchTo(oldctx);
> >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> >
> > Why is this change required for this patch and why we only cover
> > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > The comment is also not very clear on this point.
>
> Added the comments for the same.
>

1.
+ /*
+ * Shared fileset handle must be allocated in the persistent context.
+ * Also, SharedFileSetInit allocate the memory for sharefileset list
+ * so we need to allocate that in the long term meemory context.
+ */

How about "We need to maintain shared fileset across multiple stream
open/close calls.  So, we allocate it in a persistent context."

2.
+ /*
+ * If the caller is following the dsm based cleanup then we don't
+ * maintain the filesetlist so return.
+ */
+ if (filesetlist == NULL)
+ return;

The check here should use 'NIL' instead of 'NULL'

Other than that the changes in this particular patch looks good to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

28 июня 2020 г., 18:25:21

On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Comments on other patches:
> =========================

Replying to the pending comments.

> 6.
> In v29-0002-Issue-individual-invalidations-with-wal_level-lo,
> xact_desc_invalidations seems to be a subset of
> standby_desc_invalidations, can we have a common code for them?

Done

> 7.
> I think we can avoid sending v29-0007-Track-statistics-for-streaming
> this each time.  We can do this after the main patch is complete.
> Also, we might need to change how and where these stats will be
> tracked.  See the related discussion [1].

Removed

> 8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer,
>   * Return oldest transaction in reorderbuffer
> @@ -863,6 +909,9 @@ ReorderBufferAssignChild(ReorderBuffer *rb,
> TransactionId xid,
>   /* set the reference to top-level transaction */
>   subtxn->toptxn = txn;
>
> + /* set the reference to toplevel transaction */
> + subtxn->toptxn = txn;
> +
>
> There is a double initialization of subtxn->toptxn.  You need to
> remove this line from 0005 patch as we have now added it in an earlier
> patch.

Done

> 9.  I think you forgot to update the patch to execute invalidations in
> Abort case or I might be missing something.  I don't see any changes
> in ReorderBufferAbort. You have agreed in one of the emails above [2]
> about handling the same.

Done, check 0005

> 10. In v29-0008-Add-support-for-streaming-to-built-in-replicatio,
>  apply_handle_stream_commit(StringInfo s)
>  {
>  ..
>  + /*
>  + * send feedback to upstream
>  + *
>  + * XXX Probably should send a valid LSN. But which one?
>  + */
>  + send_feedback(InvalidXLogRecPtr, false, false);
>  ..
>  }
>
> I have given a comment on this code that we don't need this feedback
> and you mentioned on June 02 [3] that you will think on it and let me
> know your opinion but I don't see a response from you yet.  Can you
> get back to me regarding this point?

Yeah, I have analyzed this and this seems we don't need this.  Because
like non-streaming mode here also sending feedback mechanisms shall be
the same.  I don't see any reason for sending extra feedback on
commit.

> 11. Add some comments as to why we have used Shared BufFile interface
> instead of Temp BufFile interface?

Done

> 12. In v29-0013-Change-buffile-interface-required-for-streaming,
> + * Initialize a space for temporary files that can be opened other backends.
>
> /opened other backends/opened for access by other backends

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v30.tar

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

28 июня 2020 г., 18:26:51

On Fri, Jun 26, 2020 at 11:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 25, 2020 at 7:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Review comments on various patches.
> > >
> > > poc_shared_fileset_cleanup_on_procexit
> > > =================================
> > > 1.
> > > - ent->subxact_fileset =
> > > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
> > > + MemoryContext oldctx;
> > >
> > > + /* Shared fileset handle must be allocated in the persistent context */
> > > + oldctx = MemoryContextSwitchTo(ApplyContext);
> > > + ent->subxact_fileset = palloc(sizeof(SharedFileSet));
> > >   SharedFileSetInit(ent->subxact_fileset, NULL);
> > > + MemoryContextSwitchTo(oldctx);
> > >   fd = BufFileCreateShared(ent->subxact_fileset, path);
> > >
> > > Why is this change required for this patch and why we only cover
> > > SharedFileSetInit in the Apply context and not BufFileCreateShared?
> > > The comment is also not very clear on this point.
> >
> > Added the comments for the same.
> >
>
> 1.
> + /*
> + * Shared fileset handle must be allocated in the persistent context.
> + * Also, SharedFileSetInit allocate the memory for sharefileset list
> + * so we need to allocate that in the long term meemory context.
> + */
>
> How about "We need to maintain shared fileset across multiple stream
> open/close calls.  So, we allocate it in a persistent context."

Done

> 2.
> + /*
> + * If the caller is following the dsm based cleanup then we don't
> + * maintain the filesetlist so return.
> + */
> + if (filesetlist == NULL)
> + return;
>
> The check here should use 'NIL' instead of 'NULL'

Done

> Other than that the changes in this particular patch looks good to me.
Added as a last patch in the series, in the next version I will merge
this to 0012 and 0013.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions

От

Dilip Kumar

Дата:

29 июня 2020 г., 13:54:13

On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Yes, I have made the changes.  Basically, now I am only using the
> > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
> > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
> > > are directly appending it to the txn->invalidations.  I have tested
> > > the XLOG_INVALIDATIONS part but while sending this mail I realized
> > > that we could write some automated test for the same.
> > >
> >
> > Can you share how you have tested it?
> >
> > >  I will work on
> > > that soon.
> > >
> >
> > Cool, I think having a regression test for this will be a good idea.
> >
>
> Other than above tests, can we somehow verify that the invalidations
> generated at commit time are the same as what we do with this patch?
> We have verified with individual commands but it would be great if we
> can verify for the regression tests.

I have verified this using a few random test cases.  For verifying
this I have made some temporary code changes with an assert as shown
below.  Basically, on DecodeCommit we call
ReorderBufferAddInvalidations function only for an assert checking.

-void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
                                                          XLogRecPtr
lsn, Size nmsgs,
-
SharedInvalidationMessage *msgs)
+
SharedInvalidationMessage *msgs, bool commit)
 {
        ReorderBufferTXN *txn;

        txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
-
+       if (commit)
+       {
+               Assert(txn->ninvalidations == nmsgs);
+               return;
+       }

The result is that for a normal local test it works fine.  But with
regression suit, it hit an assert at many places because if the
rollback of the subtransaction is involved then at commit time
invalidation messages those are not logged whereas with command time
invalidation those are logged.

As of now, I have only put assert on the count,  if we need to verify
the exact messages then we might need to somehow categories the
invalidation messages because the ordering of the messages will not be
the same.  For testing this we will have to arrange them by category
i.e relcahce, catcache and then we can compare them.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

30 июня 2020 г., 06:49:45

On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Other than above tests, can we somehow verify that the invalidations
> > generated at commit time are the same as what we do with this patch?
> > We have verified with individual commands but it would be great if we
> > can verify for the regression tests.
>
> I have verified this using a few random test cases.  For verifying
> this I have made some temporary code changes with an assert as shown
> below.  Basically, on DecodeCommit we call
> ReorderBufferAddInvalidations function only for an assert checking.
>
> -void
>  ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
>                                                           XLogRecPtr
> lsn, Size nmsgs,
> -
> SharedInvalidationMessage *msgs)
> +
> SharedInvalidationMessage *msgs, bool commit)
>  {
>         ReorderBufferTXN *txn;
>
>         txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> -
> +       if (commit)
> +       {
> +               Assert(txn->ninvalidations == nmsgs);
> +               return;
> +       }
>
> The result is that for a normal local test it works fine.  But with
> regression suit, it hit an assert at many places because if the
> rollback of the subtransaction is involved then at commit time
> invalidation messages those are not logged whereas with command time
> invalidation those are logged.
>

Yeah, somehow, we need to ignore rollback to savepoint tests and
verify for others.

> As of now, I have only put assert on the count,  if we need to verify
> the exact messages then we might need to somehow categories the
> invalidation messages because the ordering of the messages will not be
> the same.  For testing this we will have to arrange them by category
> i.e relcahce, catcache and then we can compare them.
>

Can't we do this by verifying that each message at commit time exists
in the list of invalidation messages we have collected via processing
XLOG_XACT_INVALIDATIONS?

One additional question on patch
v30-0003-Extend-the-output-plugin-API-with-stream-methods:
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr apply_lsn)
{
..
..
+ state.report_location = apply_lsn;
..
..
+ ctx->write_location = apply_lsn;
..
}

Can't we name the last parameter as 'commit_lsn' as that is how
documentation in the patch spells it and it sounds more appropriate?
Also, is there a reason for assigning report_location and
write_location differently than what we do in commit_cb_wrapper?
Basically, assign those as txn->final_lsn and txn->end_lsn
respectively.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

30 июня 2020 г., 07:43:17

On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Other than above tests, can we somehow verify that the invalidations
> > > generated at commit time are the same as what we do with this patch?
> > > We have verified with individual commands but it would be great if we
> > > can verify for the regression tests.
> >
> > I have verified this using a few random test cases.  For verifying
> > this I have made some temporary code changes with an assert as shown
> > below.  Basically, on DecodeCommit we call
> > ReorderBufferAddInvalidations function only for an assert checking.
> >
> > -void
> >  ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
> >                                                           XLogRecPtr
> > lsn, Size nmsgs,
> > -
> > SharedInvalidationMessage *msgs)
> > +
> > SharedInvalidationMessage *msgs, bool commit)
> >  {
> >         ReorderBufferTXN *txn;
> >
> >         txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > -
> > +       if (commit)
> > +       {
> > +               Assert(txn->ninvalidations == nmsgs);
> > +               return;
> > +       }
> >
> > The result is that for a normal local test it works fine.  But with
> > regression suit, it hit an assert at many places because if the
> > rollback of the subtransaction is involved then at commit time
> > invalidation messages those are not logged whereas with command time
> > invalidation those are logged.
> >
>
> Yeah, somehow, we need to ignore rollback to savepoint tests and
> verify for others.

Yeah, I have run the regression suite,  I can see a lot of failure
maybe we can somehow see the diff and confirm that all the failures
are due to rollback to savepoint only.  I will work on this.

>
> > As of now, I have only put assert on the count,  if we need to verify
> > the exact messages then we might need to somehow categories the
> > invalidation messages because the ordering of the messages will not be
> > the same.  For testing this we will have to arrange them by category
> > i.e relcahce, catcache and then we can compare them.
> >
>
> Can't we do this by verifying that each message at commit time exists
> in the list of invalidation messages we have collected via processing
> XLOG_XACT_INVALIDATIONS?

Let me try what is the easiest way to test this.

>
> One additional question on patch
> v30-0003-Extend-the-output-plugin-API-with-stream-methods:
> +static void
> +stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
> + XLogRecPtr apply_lsn)
> {
> ..
> ..
> + state.report_location = apply_lsn;
> ..
> ..
> + ctx->write_location = apply_lsn;
> ..
> }
>
> Can't we name the last parameter as 'commit_lsn' as that is how
> documentation in the patch spells it and it sounds more appropriate?

You are right commit_lsn seems more appropriate here.

> Also, is there a reason for assigning report_location and
> write_location differently than what we do in commit_cb_wrapper?
> Basically, assign those as txn->final_lsn and txn->end_lsn
> respectively.

Yes, I think it should be handled in same way as commit_cb_wrapper.
Because before calling ReorderBufferStreamCommit in
ReorderBufferCommit, we are properly updating the final_lsn as well as
the end_lsn.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

30 июня 2020 г., 14:50:14

On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Can't we name the last parameter as 'commit_lsn' as that is how
> > documentation in the patch spells it and it sounds more appropriate?
>
> You are right commit_lsn seems more appropriate here.
>
> > Also, is there a reason for assigning report_location and
> > write_location differently than what we do in commit_cb_wrapper?
> > Basically, assign those as txn->final_lsn and txn->end_lsn
> > respectively.
>
> Yes, I think it should be handled in same way as commit_cb_wrapper.
> Because before calling ReorderBufferStreamCommit in
> ReorderBufferCommit, we are properly updating the final_lsn as well as
> the end_lsn.
>

Okay, I have made these changes in the attached patch and there are
few more changes in
0003-Extend-the-output-plugin-API-with-stream-methods.
1. In pg_decode_stream_message, for transactional messages, we were
displaying message contents which is different from other streaming
APIs.  I have changed it so that streaming API doesn't display message
contents for transactional messages.

2.
+ /* in streaming mode, stream_change_cb is required */
+ if (ctx->callbacks.stream_change_cb == NULL)
+ ereport(ERROR,
+ (errmsg("Output plugin supports streaming, but has not registered "
+ "stream_change_cb callback.")));

The error messages seem a bit weird.  (a) doesn't include error code,
(b) not in PG style. I have changed all the error messages to fix
these two issues and change the message as well

3. Rearranged the functions stream_* so that the optional functions
are at the end and also arranged other functions in a way that looks
more logical to me.

4. Updated comments, commit message, and edited docs in the patch.

I have made a few changes in
0004-Gracefully-handle-concurrent-aborts-of-transacti as well.
1. The variable bsysscan was not being reset in case of error.  I have
introduced a new function to reset both bsysscan and CheckXidAlive
during transaction abort.  Also, snapmgr.c doesn't seem right place
for these variables, so I moved them to xact.c.  I think this will
make the initialization of CheckXidAlive during catch in
ReorderBufferProcessTXN redundant.

2. Updated comments and commit message.

Let me know what you think about the above changes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

04 июля 2020 г., 09:04:48

On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Let me know what you think about the above changes.
>

I went ahead and made few changes in
0005-Implement-streaming-mode-in-ReorderBuffer which are explained
below.  I have few questions and suggestions for the patch as well
which are also covered in below points.

1.
+ if (prev_lsn == InvalidXLogRecPtr)
+ {
+ if (streaming)
+ rb->stream_start(rb, txn, change->lsn);
+ else
+ rb->begin(rb, txn);
+ stream_started = true;
+ }

I don't think we want to move begin callback here that will change the
existing semantics, so it is better to move begin at its original
position. I have made the required changes in the attached patch.

2.
ReorderBufferTruncateTXN()
{
..
+ dlist_foreach_modify(iter, &txn->changes)
+ {
+ ReorderBufferChange *change;
+
+ change = dlist_container(ReorderBufferChange, node, iter.cur);
+
+ /* remove the change from it's containing list */
+ dlist_delete(&change->node);
+
+ ReorderBufferReturnChange(rb, change);
+ }
..
}

I think here we can add an Assert that we're not mixing changes from
different transactions.  See the changes in the patch.

3.
SetupCheckXidLive()
{
..
+ /*
+ * setup CheckXidAlive if it's not committed yet. We don't check if the xid
+ * aborted. That will happen during catalog access.  Also, reset the
+ * bsysscan flag.
+ */
+ if (!TransactionIdDidCommit(xid))
+ {
+ CheckXidAlive = xid;
+ bsysscan = false;
..
}

What is the need to reset bsysscan flag here if we are already
resetting on error (like in the previous patch sent by me)?

4.
ReorderBufferProcessTXN()
{
..
..
+ /* Reset the CheckXidAlive */
+ if (streaming)
+ CheckXidAlive = InvalidTransactionId;
..
}

Similar to the previous point, we don't need this as well because
AbortCurrentTransaction would have taken care of this.

5.
+ * XXX Do we need to check if the transaction has some changes to stream
+ * (maybe it got streamed right before the commit, which attempts to
+ * stream it again before the commit)?
+ */
+static void
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)

The above comment doesn't make much sense to me, so I have removed it.
Basically, if there are no changes before commit, we still need to
send commit and anyway if there are no more changes
ReorderBufferProcessTXN will not do anything.

6.
+ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
{
..
if (txn->snapshot_now == NULL)
+ {
+ dlist_iter subxact_i;
+
+ /* make sure this transaction is streamed for the first time */
+ Assert(!rbtxn_is_streamed(txn));
+
+ /* at the beginning we should have invalid command ID */
+ Assert(txn->command_id == InvalidCommandId);
+
+ dlist_foreach(subxact_i, &txn->subtxns)
+ {
+ ReorderBufferTXN *subtxn;
+
+ subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
+ ReorderBufferTransferSnapToParent(txn, subtxn);
+ }
..
}

Here, it is possible that there is no base_snapshot for txn, so we
need a check for that similar to ReorderBufferCommit.

7.  Apart from the above, I made few changes in comments and ran pgindent.

8. We can't stream the transaction before we reach the
SNAPBUILD_CONSISTENT state because some other output plugin can apply
those changes unlike what we do with pgoutput plugin (which writes to
file). And, I think applying the transactions without reaching a
consistent state would be anyway wrong.  So, we should avoid that and
if do that then we should have an Assert for streamed txns rather than
sending abort for them in ReorderBufferForget.

9.
+ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
{
..
+ ReorderBufferToastReset(rb, txn);
+ if (specinsert != NULL)
+ ReorderBufferReturnChange(rb, specinsert);
..
}

Why do we need to do these here when we wouldn't have been done for
any exception other than ERRCODE_TRANSACTION_ROLLBACK?

10.  I have got the below failure once.  I have not investigated this
in detail as the patch is still under progress.  See, if you have any
idea?
#   Failed test 'check extra columns contain local defaults'
#   at t/013_stream_subxact_ddl_abort.pl line 81.
#          got: '2|0'
#     expected: '1000|500'
# Looks like you failed 1 test of 2.
make[2]: *** [check] Error 1
make[1]: *** [check-subscription-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2

11. Can we test by introducing a new GUC such that all the
transactions (at least in existing tests) start to stream?  Basically,
it will allow us to disregard logical_decoding_work_mem and ensure
that all regression tests pass through new-code.  Note, I am
suggesting this just for testing purposes, not for actual integration
in the code.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

v31.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

05 июля 2020 г., 14:05:08

On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Can't we name the last parameter as 'commit_lsn' as that is how
> > > documentation in the patch spells it and it sounds more appropriate?
> >
> > You are right commit_lsn seems more appropriate here.
> >
> > > Also, is there a reason for assigning report_location and
> > > write_location differently than what we do in commit_cb_wrapper?
> > > Basically, assign those as txn->final_lsn and txn->end_lsn
> > > respectively.
> >
> > Yes, I think it should be handled in same way as commit_cb_wrapper.
> > Because before calling ReorderBufferStreamCommit in
> > ReorderBufferCommit, we are properly updating the final_lsn as well as
> > the end_lsn.
> >
>
> Okay, I have made these changes in the attached patch and there are
> few more changes in
> 0003-Extend-the-output-plugin-API-with-stream-methods.
> 1. In pg_decode_stream_message, for transactional messages, we were
> displaying message contents which is different from other streaming
> APIs.  I have changed it so that streaming API doesn't display message
> contents for transactional messages.

Ok, make sense.

> 2.
> + /* in streaming mode, stream_change_cb is required */
> + if (ctx->callbacks.stream_change_cb == NULL)
> + ereport(ERROR,
> + (errmsg("Output plugin supports streaming, but has not registered "
> + "stream_change_cb callback.")));
>
> The error messages seem a bit weird.  (a) doesn't include error code,
> (b) not in PG style. I have changed all the error messages to fix
> these two issues and change the message as well

ok

> 3. Rearranged the functions stream_* so that the optional functions
> are at the end and also arranged other functions in a way that looks
> more logical to me.

Make sense to me.

> 4. Updated comments, commit message, and edited docs in the patch.
>
> I have made a few changes in
> 0004-Gracefully-handle-concurrent-aborts-of-transacti as well.
> 1. The variable bsysscan was not being reset in case of error.  I have
> introduced a new function to reset both bsysscan and CheckXidAlive
> during transaction abort.  Also, snapmgr.c doesn't seem right place
> for these variables, so I moved them to xact.c.  I think this will
> make the initialization of CheckXidAlive during catch in
> ReorderBufferProcessTXN redundant.

That looks better.

> 2. Updated comments and commit message.
>
> Let me know what you think about the above changes.

All the above changes look good to me and I will include in the next version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

05 июля 2020 г., 14:17:03

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Let me know what you think about the above changes.
> >
>
> I went ahead and made few changes in
> 0005-Implement-streaming-mode-in-ReorderBuffer which are explained
> below.  I have few questions and suggestions for the patch as well
> which are also covered in below points.
>
> 1.
> + if (prev_lsn == InvalidXLogRecPtr)
> + {
> + if (streaming)
> + rb->stream_start(rb, txn, change->lsn);
> + else
> + rb->begin(rb, txn);
> + stream_started = true;
> + }
>
> I don't think we want to move begin callback here that will change the
> existing semantics, so it is better to move begin at its original
> position. I have made the required changes in the attached patch.

Looks good to me.

> 2.
> ReorderBufferTruncateTXN()
> {
> ..
> + dlist_foreach_modify(iter, &txn->changes)
> + {
> + ReorderBufferChange *change;
> +
> + change = dlist_container(ReorderBufferChange, node, iter.cur);
> +
> + /* remove the change from it's containing list */
> + dlist_delete(&change->node);
> +
> + ReorderBufferReturnChange(rb, change);
> + }
> ..
> }
>
> I think here we can add an Assert that we're not mixing changes from
> different transactions.  See the changes in the patch.

Looks fine.

> 3.
> SetupCheckXidLive()
> {
> ..
> + /*
> + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access.  Also, reset the
> + * bsysscan flag.
> + */
> + if (!TransactionIdDidCommit(xid))
> + {
> + CheckXidAlive = xid;
> + bsysscan = false;
> ..
> }
>
> What is the need to reset bsysscan flag here if we are already
> resetting on error (like in the previous patch sent by me)?

Yeah, now we don't not need this.

> 4.
> ReorderBufferProcessTXN()
> {
> ..
> ..
> + /* Reset the CheckXidAlive */
> + if (streaming)
> + CheckXidAlive = InvalidTransactionId;
> ..
> }
>
> Similar to the previous point, we don't need this as well because
> AbortCurrentTransaction would have taken care of this.

Right

> 5.
> + * XXX Do we need to check if the transaction has some changes to stream
> + * (maybe it got streamed right before the commit, which attempts to
> + * stream it again before the commit)?
> + */
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
>
> The above comment doesn't make much sense to me, so I have removed it.
> Basically, if there are no changes before commit, we still need to
> send commit and anyway if there are no more changes
> ReorderBufferProcessTXN will not do anything.

ok

> 6.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> if (txn->snapshot_now == NULL)
> + {
> + dlist_iter subxact_i;
> +
> + /* make sure this transaction is streamed for the first time */
> + Assert(!rbtxn_is_streamed(txn));
> +
> + /* at the beginning we should have invalid command ID */
> + Assert(txn->command_id == InvalidCommandId);
> +
> + dlist_foreach(subxact_i, &txn->subtxns)
> + {
> + ReorderBufferTXN *subtxn;
> +
> + subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> + ReorderBufferTransferSnapToParent(txn, subtxn);
> + }
> ..
> }
>
> Here, it is possible that there is no base_snapshot for txn, so we
> need a check for that similar to ReorderBufferCommit.
>
> 7.  Apart from the above, I made few changes in comments and ran pgindent.

Ok

> 8. We can't stream the transaction before we reach the
> SNAPBUILD_CONSISTENT state because some other output plugin can apply
> those changes unlike what we do with pgoutput plugin (which writes to
> file). And, I think applying the transactions without reaching a
> consistent state would be anyway wrong.  So, we should avoid that and
> if do that then we should have an Assert for streamed txns rather than
> sending abort for them in ReorderBufferForget.

I will work on this point.

> 9.
> +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> {
> ..
> + ReorderBufferToastReset(rb, txn);
> + if (specinsert != NULL)
> + ReorderBufferReturnChange(rb, specinsert);
> ..
> }
>
> Why do we need to do these here when we wouldn't have been done for
> any exception other than ERRCODE_TRANSACTION_ROLLBACK?

Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
gracefully and we are continuing with further decoding so we need to
return this change back.

> 10.  I have got the below failure once.  I have not investigated this
> in detail as the patch is still under progress.  See, if you have any
> idea?
> #   Failed test 'check extra columns contain local defaults'
> #   at t/013_stream_subxact_ddl_abort.pl line 81.
> #          got: '2|0'
> #     expected: '1000|500'
> # Looks like you failed 1 test of 2.
> make[2]: *** [check] Error 1
> make[1]: *** [check-subscription-recurse] Error 2
> make[1]: *** Waiting for unfinished jobs....
> make: *** [check-world-src/test-recurse] Error 2

Even I got the failure once and after that, it did not reproduce.  I
have executed it multiple time but it did not reproduce again.  Are
you able to reproduce it consistently?

> 11. Can we test by introducing a new GUC such that all the
> transactions (at least in existing tests) start to stream?  Basically,
> it will allow us to disregard logical_decoding_work_mem and ensure
> that all regression tests pass through new-code.  Note, I am
> suggesting this just for testing purposes, not for actual integration
> in the code.

Yeah,  that's a good suggestion.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

05 июля 2020 г., 18:07:18

On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > Other than above tests, can we somehow verify that the invalidations
> > > > generated at commit time are the same as what we do with this patch?
> > > > We have verified with individual commands but it would be great if we
> > > > can verify for the regression tests.
> > >
> > > I have verified this using a few random test cases.  For verifying
> > > this I have made some temporary code changes with an assert as shown
> > > below.  Basically, on DecodeCommit we call
> > > ReorderBufferAddInvalidations function only for an assert checking.
> > >
> > > -void
> > >  ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
> > >                                                           XLogRecPtr
> > > lsn, Size nmsgs,
> > > -
> > > SharedInvalidationMessage *msgs)
> > > +
> > > SharedInvalidationMessage *msgs, bool commit)
> > >  {
> > >         ReorderBufferTXN *txn;
> > >
> > >         txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > > -
> > > +       if (commit)
> > > +       {
> > > +               Assert(txn->ninvalidations == nmsgs);
> > > +               return;
> > > +       }
> > >
> > > The result is that for a normal local test it works fine.  But with
> > > regression suit, it hit an assert at many places because if the
> > > rollback of the subtransaction is involved then at commit time
> > > invalidation messages those are not logged whereas with command time
> > > invalidation those are logged.
> > >
> >
> > Yeah, somehow, we need to ignore rollback to savepoint tests and
> > verify for others.
>
> Yeah, I have run the regression suite,  I can see a lot of failure
> maybe we can somehow see the diff and confirm that all the failures
> are due to rollback to savepoint only.  I will work on this.

I have compared the changes logged at command end vs logged at commit
time.  I have ignored the invalidation for the transaction which has
any aborted subtransaction in it.  While testing this I found one
issue, the issue is that if there are some invalidation generated
between last command counter increment and the commit transaction then
those were not logged.  I have fixed the issue by logging the pending
invalidation in RecordTransactionCommit.  I will include the changes
in the next patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v32-0002-WAL-Log-invalidations-at-command-end-with-wal_le.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

06 июля 2020 г., 09:01:06

On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 9.
> > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > {
> > ..
> > + ReorderBufferToastReset(rb, txn);
> > + if (specinsert != NULL)
> > + ReorderBufferReturnChange(rb, specinsert);
> > ..
> > }
> >
> > Why do we need to do these here when we wouldn't have been done for
> > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
>
> Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> gracefully and we are continuing with further decoding so we need to
> return this change back.
>

Okay, then I suggest we should do these before calling stream_stop and
also move ReorderBufferResetTXN after calling stream_stop  to follow a
pattern similar to try block unless there is a reason for not doing
so.  Also, it would be good if we can initialize specinsert with NULL
after returning the change as we are doing at other places.

> > 10.  I have got the below failure once.  I have not investigated this
> > in detail as the patch is still under progress.  See, if you have any
> > idea?
> > #   Failed test 'check extra columns contain local defaults'
> > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > #          got: '2|0'
> > #     expected: '1000|500'
> > # Looks like you failed 1 test of 2.
> > make[2]: *** [check] Error 1
> > make[1]: *** [check-subscription-recurse] Error 2
> > make[1]: *** Waiting for unfinished jobs....
> > make: *** [check-world-src/test-recurse] Error 2
>
> Even I got the failure once and after that, it did not reproduce.  I
> have executed it multiple time but it did not reproduce again.  Are
> you able to reproduce it consistently?
>

No, I am also not able to reproduce it consistently but I think this
can fail if a subscriber sends the replay_location before actually
replaying the changes.  First, I thought that extra send_feedback we
have in apply_handle_stream_commit might have caused this but I guess
that can't happen because we need the commit time location for that
and we are storing the same at the end of apply_handle_stream_commit
after applying all messages.  I am not sure what is going on here.  I
think we somehow need to reproduce this or some variant of this test
consistently to find the root cause.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

06 июля 2020 г., 09:13:59

On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > 9.
> > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > {
> > > ..
> > > + ReorderBufferToastReset(rb, txn);
> > > + if (specinsert != NULL)
> > > + ReorderBufferReturnChange(rb, specinsert);
> > > ..
> > > }
> > >
> > > Why do we need to do these here when we wouldn't have been done for
> > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> >
> > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > gracefully and we are continuing with further decoding so we need to
> > return this change back.
> >
>
> Okay, then I suggest we should do these before calling stream_stop and
> also move ReorderBufferResetTXN after calling stream_stop  to follow a
> pattern similar to try block unless there is a reason for not doing
> so.  Also, it would be good if we can initialize specinsert with NULL
> after returning the change as we are doing at other places.

Okay

> > > 10.  I have got the below failure once.  I have not investigated this
> > > in detail as the patch is still under progress.  See, if you have any
> > > idea?
> > > #   Failed test 'check extra columns contain local defaults'
> > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > #          got: '2|0'
> > > #     expected: '1000|500'
> > > # Looks like you failed 1 test of 2.
> > > make[2]: *** [check] Error 1
> > > make[1]: *** [check-subscription-recurse] Error 2
> > > make[1]: *** Waiting for unfinished jobs....
> > > make: *** [check-world-src/test-recurse] Error 2
> >
> > Even I got the failure once and after that, it did not reproduce.  I
> > have executed it multiple time but it did not reproduce again.  Are
> > you able to reproduce it consistently?
> >
>
> No, I am also not able to reproduce it consistently but I think this
> can fail if a subscriber sends the replay_location before actually
> replaying the changes.  First, I thought that extra send_feedback we
> have in apply_handle_stream_commit might have caused this but I guess
> that can't happen because we need the commit time location for that
> and we are storing the same at the end of apply_handle_stream_commit
> after applying all messages.  I am not sure what is going on here.  I
> think we somehow need to reproduce this or some variant of this test
> consistently to find the root cause.

And I think it appeared first time for me,  so maybe either induced
from past few versions so some changes in the last few versions might
have exposed it.  I have noticed that almost 50% of the time I am able
to reproduce after the clean build so I can trace back from which
version it started appearing that way it will be easy to narrow down.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

06 июля 2020 г., 12:39:19

On Mon, Jul 6, 2020 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > > > 10.  I have got the below failure once.  I have not investigated this
> > > > in detail as the patch is still under progress.  See, if you have any
> > > > idea?
> > > > #   Failed test 'check extra columns contain local defaults'
> > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > #          got: '2|0'
> > > > #     expected: '1000|500'
> > > > # Looks like you failed 1 test of 2.
> > > > make[2]: *** [check] Error 1
> > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > make[1]: *** Waiting for unfinished jobs....
> > > > make: *** [check-world-src/test-recurse] Error 2
> > >
> > > Even I got the failure once and after that, it did not reproduce.  I
> > > have executed it multiple time but it did not reproduce again.  Are
> > > you able to reproduce it consistently?
> > >
> >
> > No, I am also not able to reproduce it consistently but I think this
> > can fail if a subscriber sends the replay_location before actually
> > replaying the changes.  First, I thought that extra send_feedback we
> > have in apply_handle_stream_commit might have caused this but I guess
> > that can't happen because we need the commit time location for that
> > and we are storing the same at the end of apply_handle_stream_commit
> > after applying all messages.  I am not sure what is going on here.  I
> > think we somehow need to reproduce this or some variant of this test
> > consistently to find the root cause.
>
> And I think it appeared first time for me,  so maybe either induced
> from past few versions so some changes in the last few versions might
> have exposed it.  I have noticed that almost 50% of the time I am able
> to reproduce after the clean build so I can trace back from which
> version it started appearing that way it will be easy to narrow down.
>

One more comment
ReorderBufferLargestTopTXN
{
..
dlist_foreach(iter, &rb->toplevel_by_lsn)
  {
  ReorderBufferTXN *txn;
+ Size size = 0;
+ Size largest_size = 0;

  txn = dlist_container(ReorderBufferTXN, node, iter.cur);

- /* if the current transaction is larger, remember it */
- if ((!largest) || (txn->size > largest->size))
+ /*
+ * If this transaction have some incomplete changes then only consider
+ * the size upto last complete lsn.
+ */
+ if (rbtxn_has_incomplete_tuple(txn))
+ size = txn->complete_size;
+ else
+ size = txn->total_size;
+
+ /* If the current transaction is larger then remember it. */
+ if ((largest != NULL || size > largest_size) && size > 0)

Here largest_size is a local variable inside the loop which is
initialized to 0 in each iteration and that will lead to picking each
next txn as largest.  This seems wrong to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

07 июля 2020 г., 18:29:27

On Mon, Jul 6, 2020 at 3:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > idea?
> > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > #          got: '2|0'
> > > > > #     expected: '1000|500'
> > > > > # Looks like you failed 1 test of 2.
> > > > > make[2]: *** [check] Error 1
> > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > make[1]: *** Waiting for unfinished jobs....
> > > > > make: *** [check-world-src/test-recurse] Error 2
> > > >
> > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > have executed it multiple time but it did not reproduce again.  Are
> > > > you able to reproduce it consistently?
> > > >
> > >
> > > No, I am also not able to reproduce it consistently but I think this
> > > can fail if a subscriber sends the replay_location before actually
> > > replaying the changes.  First, I thought that extra send_feedback we
> > > have in apply_handle_stream_commit might have caused this but I guess
> > > that can't happen because we need the commit time location for that
> > > and we are storing the same at the end of apply_handle_stream_commit
> > > after applying all messages.  I am not sure what is going on here.  I
> > > think we somehow need to reproduce this or some variant of this test
> > > consistently to find the root cause.
> >
> > And I think it appeared first time for me,  so maybe either induced
> > from past few versions so some changes in the last few versions might
> > have exposed it.  I have noticed that almost 50% of the time I am able
> > to reproduce after the clean build so I can trace back from which
> > version it started appearing that way it will be easy to narrow down.
> >
>
> One more comment
> ReorderBufferLargestTopTXN
> {
> ..
> dlist_foreach(iter, &rb->toplevel_by_lsn)
>   {
>   ReorderBufferTXN *txn;
> + Size size = 0;
> + Size largest_size = 0;
>
>   txn = dlist_container(ReorderBufferTXN, node, iter.cur);
>
> - /* if the current transaction is larger, remember it */
> - if ((!largest) || (txn->size > largest->size))
> + /*
> + * If this transaction have some incomplete changes then only consider
> + * the size upto last complete lsn.
> + */
> + if (rbtxn_has_incomplete_tuple(txn))
> + size = txn->complete_size;
> + else
> + size = txn->total_size;
> +
> + /* If the current transaction is larger then remember it. */
> + if ((largest != NULL || size > largest_size) && size > 0)
>
> Here largest_size is a local variable inside the loop which is
> initialized to 0 in each iteration and that will lead to picking each
> next txn as largest.  This seems wrong to me.

You are right, will fix.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

08 июля 2020 г., 07:06:01

On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 10:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > Yeah, I have run the regression suite,  I can see a lot of failure
> > maybe we can somehow see the diff and confirm that all the failures
> > are due to rollback to savepoint only.  I will work on this.
>
> I have compared the changes logged at command end vs logged at commit
> time.  I have ignored the invalidation for the transaction which has
> any aborted subtransaction in it.  While testing this I found one
> issue, the issue is that if there are some invalidation generated
> between last command counter increment and the commit transaction then
> those were not logged.  I have fixed the issue by logging the pending
> invalidation in RecordTransactionCommit.  I will include the changes
> in the next patch set.
>

I think it would have been better if you could have given examples for
such cases where you need this extra logging.  Anyway, below are few
minor comments on this patch:

1.
+ /*
+ * Log any pending invalidations which are adding between the last
+ * command counter increment and the commit.
+ */
+ if (XLogLogicalInfoActive())
+ LogLogicalInvalidations();

I think we can change this comment slightly and extend a bit to say
for which kind of special cases we are adding this. "Log any pending
invalidations which are added between the last CommandCounterIncrement
and the commit.  Normally for DDLs, we log this at each command end,
however for certain cases where we directly update the system table
the invalidations were not logged at command end."

Something like above based on cases that are not covered by command
end WAL logging.

2.
+ * Emit WAL for invalidations.  This is currently only used for logging
+ * invalidations at the command end.
+ */
+void
+LogLogicalInvalidations()

After this is getting used at a new place, it is better to modify the
above comment to something like: "Emit WAL for invalidations.  This is
currently only used for logging invalidations at the command end or at
commit time if any invalidations are pending."

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

08 июля 2020 г., 13:02:07

On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have compared the changes logged at command end vs logged at commit
> > time.  I have ignored the invalidation for the transaction which has
> > any aborted subtransaction in it.  While testing this I found one
> > issue, the issue is that if there are some invalidation generated
> > between last command counter increment and the commit transaction then
> > those were not logged.  I have fixed the issue by logging the pending
> > invalidation in RecordTransactionCommit.  I will include the changes
> > in the next patch set.
> >
>
> I think it would have been better if you could have given examples for
> such cases where you need this extra logging.  Anyway, below are few
> minor comments on this patch:
>
> 1.
> + /*
> + * Log any pending invalidations which are adding between the last
> + * command counter increment and the commit.
> + */
> + if (XLogLogicalInfoActive())
> + LogLogicalInvalidations();
>
> I think we can change this comment slightly and extend a bit to say
> for which kind of special cases we are adding this. "Log any pending
> invalidations which are added between the last CommandCounterIncrement
> and the commit.  Normally for DDLs, we log this at each command end,
> however for certain cases where we directly update the system table
> the invalidations were not logged at command end."
>
> Something like above based on cases that are not covered by command
> end WAL logging.
>
> 2.
> + * Emit WAL for invalidations.  This is currently only used for logging
> + * invalidations at the command end.
> + */
> +void
> +LogLogicalInvalidations()
>
> After this is getting used at a new place, it is better to modify the
> above comment to something like: "Emit WAL for invalidations.  This is
> currently only used for logging invalidations at the command end or at
> commit time if any invalidations are pending."
>

I have done some more review and below are my comments:

Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
----------------------------------------------------------------------------------------------
1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
 VOLATILE ROWS 1000 COST 1000
 AS 'pg_logical_slot_get_changes';

+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
VARIADIC options text[] DEFAULT '{}',
+    OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';

If we are going to add a new streaming API for get_changes, don't we
need for pg_logical_slot_get_binary_changes,
pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
as well?  I was thinking why not add a new parameter (streaming
boolean) instead of adding the new APIs.  This could be an optional
parameter which if user doesn't specify will be considered as false.
We already have optional parameters for APIs like
pg_create_logical_replication_slot.

2. You forgot to update sgml/func.sgml.  This will be required even if
we decide to add a new parameter instead of a new API.

3.
+ /* If called has not asked for streaming changes then disable it. */
+ ctx->streaming &= streaming;

/If called/If the caller

4.
diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
 /Debug/
 /Release/
 /tmp_install/
+/build/

Why the patch contains this change?

5. If I apply the first six patches and run the regressions, it fails
primarily because streaming got enabled by default.  And then when I
applied this patch, the tests passed because it disables streaming by
default.  I think this should be patch 0007.

Replication Origins
------------------------------
I think we also need to conclude on origins related discussion [1].
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records.  The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called.  The other alternative is that we can ignore that
for now and once the usage is clear we can enhance it.  What do you
think?

[1] - https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Ajin Cherian

Дата:

08 июля 2020 г., 17:01:19

I was going through this thread and testing and reviewing the patches, I think this is a great feature to have and one which customers would appreciate. I wanted to help out, and I saw a request for a test patch for a GUC to always enable streaming on logical replication. Here's one on top of patchset v31, just in case you still need it. By default the GUC is turned on, I ran the regression tests with it and didn't see any errors.

thanks,

Ajin

Fujitsu Australia

On Wed, Jul 8, 2020 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have compared the changes logged at command end vs logged at commit
> > time. I have ignored the invalidation for the transaction which has
> > any aborted subtransaction in it. While testing this I found one
> > issue, the issue is that if there are some invalidation generated
> > between last command counter increment and the commit transaction then
> > those were not logged. I have fixed the issue by logging the pending
> > invalidation in RecordTransactionCommit. I will include the changes
> > in the next patch set.
> >
>
> I think it would have been better if you could have given examples for
> such cases where you need this extra logging. Anyway, below are few
> minor comments on this patch:
>
> 1.
> + /*
> + * Log any pending invalidations which are adding between the last
> + * command counter increment and the commit.
> + */
> + if (XLogLogicalInfoActive())
> + LogLogicalInvalidations();
>
> I think we can change this comment slightly and extend a bit to say
> for which kind of special cases we are adding this. "Log any pending
> invalidations which are added between the last CommandCounterIncrement
> and the commit. Normally for DDLs, we log this at each command end,
> however for certain cases where we directly update the system table
> the invalidations were not logged at command end."
>
> Something like above based on cases that are not covered by command
> end WAL logging.
>
> 2.
> + * Emit WAL for invalidations. This is currently only used for logging
> + * invalidations at the command end.
> + */
> +void
> +LogLogicalInvalidations()
>
> After this is getting used at a new place, it is better to modify the
> above comment to something like: "Emit WAL for invalidations. This is
> currently only used for logging invalidations at the command end or at
> commit time if any invalidations are pending."
>

I have done some more review and below are my comments:

Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
----------------------------------------------------------------------------------------------
1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
VOLATILE ROWS 1000 COST 1000
AS 'pg_logical_slot_get_changes';

+CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
+ IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
VARIADIC options text[] DEFAULT '{}',
+ OUT lsn pg_lsn, OUT xid xid, OUT data text)
+RETURNS SETOF RECORD
+LANGUAGE INTERNAL
+VOLATILE ROWS 1000 COST 1000
+AS 'pg_logical_slot_get_streaming_changes';

If we are going to add a new streaming API for get_changes, don't we
need for pg_logical_slot_get_binary_changes,
pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
as well? I was thinking why not add a new parameter (streaming
boolean) instead of adding the new APIs. This could be an optional
parameter which if user doesn't specify will be considered as false.
We already have optional parameters for APIs like
pg_create_logical_replication_slot.

2. You forgot to update sgml/func.sgml. This will be required even if
we decide to add a new parameter instead of a new API.

3.
+ /* If called has not asked for streaming changes then disable it. */
+ ctx->streaming &= streaming;

/If called/If the caller

4.
diff --git a/.gitignore b/.gitignore
index 794e35b..6083744 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,3 +42,4 @@ lib*.pc
/Debug/
/Release/
/tmp_install/
+/build/

Why the patch contains this change?

5. If I apply the first six patches and run the regressions, it fails
primarily because streaming got enabled by default. And then when I
applied this patch, the tests passed because it disables streaming by
default. I think this should be patch 0007.

Replication Origins
------------------------------
I think we also need to conclude on origins related discussion [1].
As far as I can see, the origin_id can be sent with the first startup
message. The origin_lsn and origin_commit can be sent with the last
start of streaming commit if we want but not sure if that is of use.
If we need to send it earlier then we need to record it with other WAL
records. The point is that those are set with
pg_replication_origin_xact_setup but not sure how and when that
function is called. The other alternative is that we can ignore that
for now and once the usage is clear we can enhance it. What do you
think?

[1] - https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

v31-0015-TEST-guc-always-streaming-logical.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

09 июля 2020 г., 05:27:44

On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> I was going through this thread and testing and reviewing the patches, I think this is a great feature to have and
onewhich customers would appreciate. I wanted to help out, and I saw a request for a test patch for a GUC to always
enablestreaming on logical replication. Here's one on top of patchset v31, just in case you still need it. By default
theGUC is turned on, I ran the regression tests with it and didn't see any errors. 
>

Thanks for showing the interest in patch.  How have you ensured that
streaming is happening?  I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.

I feel you need to add some DEBUG messages (or some other way) to
ensure that all existing and new test cases related to logical
decoding will perform the streaming.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Ajin Cherian

Дата:

09 июля 2020 г., 05:48:17

On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for showing the interest in patch. How have you ensured that
streaming is happening? I don't think the proposed patch can ensure
it for every case because we also rely on logical_decoding_work_mem to
decide whether to stream/spill, see ReorderBufferCheckMemoryLimit. I
think with your patch it will allow streaming for cases where we have
large amount of WAL to decode.

Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function ReorderBufferCanStream () and decides whether to stream or spill. Did I miss something?

    while (rb->size >= logical_decoding_work_mem * 1024L)
    {
        /*
         * Pick the largest transaction (or subtransaction) and evict it from
         * memory by streaming, if supported. Otherwise, spill to disk.
         */
        if (ReorderBufferCanStream(rb) &&
            (txn = ReorderBufferLargestTopTXN(rb)) != NULL)
        {
            /* we know there has to be one, because the size is not zero */
            Assert(txn && !txn->toptxn);
            Assert(txn->total_size > 0);
            Assert(rb->size >= txn->total_size);

            ReorderBufferStreamTXN(rb, txn);
        }
        else

{

I will also add debug and test as you suggested.

regards,

Ajin Cherian

Fujitsu Australia

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

09 июля 2020 г., 06:17:22

On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:
>>
>> Thanks for showing the interest in patch.  How have you ensured that
>> streaming is happening?  I don't think the proposed patch can ensure
>> it for every case because we also rely on logical_decoding_work_mem to
>> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
>> think with your patch it will allow streaming for cases where we have
>> large amount of WAL to decode.
>>
>
> Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function
ReorderBufferCanStream() and decides whether to stream or spill. Did I miss something?
 
>
>     while (rb->size >= logical_decoding_work_mem * 1024L)
>     {

There is a check before above loop:

ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
{
ReorderBufferTXN *txn;

/* bail out if we haven't exceeded the memory limit */
if (rb->size < logical_decoding_work_mem * 1024L)
return;

This will prevent the streaming/spill to occur.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

09 июля 2020 г., 06:25:22

On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >>
> >> Thanks for showing the interest in patch.  How have you ensured that
> >> streaming is happening?  I don't think the proposed patch can ensure
> >> it for every case because we also rely on logical_decoding_work_mem to
> >> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
> >> think with your patch it will allow streaming for cases where we have
> >> large amount of WAL to decode.
> >>
> >
> > Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function
ReorderBufferCanStream() and decides whether to stream or spill. Did I miss something?
 
> >
> >     while (rb->size >= logical_decoding_work_mem * 1024L)
> >     {
>
> There is a check before above loop:
>
> ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
> {
> ReorderBufferTXN *txn;
>
> /* bail out if we haven't exceeded the memory limit */
> if (rb->size < logical_decoding_work_mem * 1024L)
> return;
>
> This will prevent the streaming/spill to occur.

I think if the GUC is set then maybe we can bypass this check so that
it can try to stream every single change?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

09 июля 2020 г., 06:30:36

On Thu, Jul 9, 2020 at 8:55 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 9, 2020 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 9, 2020 at 8:18 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > On Thu, Jul 9, 2020 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >>
> > >> On Wed, Jul 8, 2020 at 7:31 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >>
> > >> Thanks for showing the interest in patch.  How have you ensured that
> > >> streaming is happening?  I don't think the proposed patch can ensure
> > >> it for every case because we also rely on logical_decoding_work_mem to
> > >> decide whether to stream/spill, see ReorderBufferCheckMemoryLimit.  I
> > >> think with your patch it will allow streaming for cases where we have
> > >> large amount of WAL to decode.
> > >>
> > >
> > > Maybe I missed something but I looked at ReorderBufferCheckMemoryLimit, even there it checks the same function
ReorderBufferCanStream() and decides whether to stream or spill. Did I miss something?
 
> > >
> > >     while (rb->size >= logical_decoding_work_mem * 1024L)
> > >     {
> >
> > There is a check before above loop:
> >
> > ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
> > {
> > ReorderBufferTXN *txn;
> >
> > /* bail out if we haven't exceeded the memory limit */
> > if (rb->size < logical_decoding_work_mem * 1024L)
> > return;
> >
> > This will prevent the streaming/spill to occur.
>
> I think if the GUC is set then maybe we can bypass this check so that
> it can try to stream every single change?
>

Yeah and probably we need to do something for the check "while
(rb->size >= logical_decoding_work_mem * 1024L)" as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

09 июля 2020 г., 12:11:35

On Wed, Jul 8, 2020 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 8, 2020 at 9:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Jul 5, 2020 at 8:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have compared the changes logged at command end vs logged at commit
> > > time.  I have ignored the invalidation for the transaction which has
> > > any aborted subtransaction in it.  While testing this I found one
> > > issue, the issue is that if there are some invalidation generated
> > > between last command counter increment and the commit transaction then
> > > those were not logged.  I have fixed the issue by logging the pending
> > > invalidation in RecordTransactionCommit.  I will include the changes
> > > in the next patch set.
> > >
> >
> > I think it would have been better if you could have given examples for
> > such cases where you need this extra logging.  Anyway, below are few
> > minor comments on this patch:
> >
> > 1.
> > + /*
> > + * Log any pending invalidations which are adding between the last
> > + * command counter increment and the commit.
> > + */
> > + if (XLogLogicalInfoActive())
> > + LogLogicalInvalidations();
> >
> > I think we can change this comment slightly and extend a bit to say
> > for which kind of special cases we are adding this. "Log any pending
> > invalidations which are added between the last CommandCounterIncrement
> > and the commit.  Normally for DDLs, we log this at each command end,
> > however for certain cases where we directly update the system table
> > the invalidations were not logged at command end."
> >
> > Something like above based on cases that are not covered by command
> > end WAL logging.
> >
> > 2.
> > + * Emit WAL for invalidations.  This is currently only used for logging
> > + * invalidations at the command end.
> > + */
> > +void
> > +LogLogicalInvalidations()
> >
> > After this is getting used at a new place, it is better to modify the
> > above comment to something like: "Emit WAL for invalidations.  This is
> > currently only used for logging invalidations at the command end or at
> > commit time if any invalidations are pending."
> >
>
> I have done some more review and below are my comments:
>
> Review-v31-0010-Provide-new-api-to-get-the-streaming-changes
> ----------------------------------------------------------------------------------------------
> 1.
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1240,6 +1240,14 @@ LANGUAGE INTERNAL
>  VOLATILE ROWS 1000 COST 1000
>  AS 'pg_logical_slot_get_changes';
>
> +CREATE OR REPLACE FUNCTION pg_logical_slot_get_streaming_changes(
> +    IN slot_name name, IN upto_lsn pg_lsn, IN upto_nchanges int,
> VARIADIC options text[] DEFAULT '{}',
> +    OUT lsn pg_lsn, OUT xid xid, OUT data text)
> +RETURNS SETOF RECORD
> +LANGUAGE INTERNAL
> +VOLATILE ROWS 1000 COST 1000
> +AS 'pg_logical_slot_get_streaming_changes';
>
> If we are going to add a new streaming API for get_changes, don't we
> need for pg_logical_slot_get_binary_changes,
> pg_logical_slot_peek_changes and pg_logical_slot_peek_binary_changes
> as well?  I was thinking why not add a new parameter (streaming
> boolean) instead of adding the new APIs.  This could be an optional
> parameter which if user doesn't specify will be considered as false.
> We already have optional parameters for APIs like
> pg_create_logical_replication_slot.
>
> 2. You forgot to update sgml/func.sgml.  This will be required even if
> we decide to add a new parameter instead of a new API.
>
> 3.
> + /* If called has not asked for streaming changes then disable it. */
> + ctx->streaming &= streaming;
>
> /If called/If the caller
>
> 4.
> diff --git a/.gitignore b/.gitignore
> index 794e35b..6083744 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -42,3 +42,4 @@ lib*.pc
>  /Debug/
>  /Release/
>  /tmp_install/
> +/build/
>
> Why the patch contains this change?
>
> 5. If I apply the first six patches and run the regressions, it fails
> primarily because streaming got enabled by default.  And then when I
> applied this patch, the tests passed because it disables streaming by
> default.  I think this should be patch 0007.

Only replying to the replication origin point, other comment looks
fine to me so I will work on those.

> Replication Origins
> ------------------------------
> I think we also need to conclude on origins related discussion [1].
> As far as I can see, the origin_id can be sent with the first startup
> message. The origin_lsn and origin_commit can be sent with the last
> start of streaming commit if we want but not sure if that is of use.
> If we need to send it earlier then we need to record it with other WAL
> records.  The point is that those are set with
> pg_replication_origin_xact_setup but not sure how and when that
> function is called.

pg_replication_origin_xact_setup is exposed function so this will
allow a user to set an origin for their session so that all the
operation done from that session will be marked by that origin id.
And the clear use case for this is to avoid sending such transactions
by suing FilterByOrigin.   But I am not sure about the point that we
discussed at [1] that what is the use of the origin and origin_lsn we
send at pgoutput_begin_txn.

 The other alternative is that we can ignore that
> for now and once the usage is clear we can enhance it.  What do you
> think?

That seems like a sensible option to me.

> [1] - https://www.postgresql.org/message-id/CAA4eK1JwXaCezFw%2BkZwoxbLKYD0nWpC2rPgx7vUsaDAc0AZaow%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

09 июля 2020 г., 14:15:22

On Thu, Jul 9, 2020 at 2:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 8, 2020 at 3:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> Only replying to the replication origin point, other comment looks
> fine to me so I will work on those.
>
> > Replication Origins
> > ------------------------------
> > I think we also need to conclude on origins related discussion [1].
> > As far as I can see, the origin_id can be sent with the first startup
> > message. The origin_lsn and origin_commit can be sent with the last
> > start of streaming commit if we want but not sure if that is of use.
> > If we need to send it earlier then we need to record it with other WAL
> > records.  The point is that those are set with
> > pg_replication_origin_xact_setup but not sure how and when that
> > function is called.
>
> pg_replication_origin_xact_setup is exposed function so this will
> allow a user to set an origin for their session so that all the
> operation done from that session will be marked by that origin id.
>

Hmm, I think that can be done by pg_replication_origin_session_setup.

> And the clear use case for this is to avoid sending such transactions
> by suing FilterByOrigin.   But I am not sure about the point that we
> discussed at [1] that what is the use of the origin and origin_lsn we
> send at pgoutput_begin_txn.
>

I could see the use of 'origin' with FilterByOrigin but not sure how
origin_lsn can be used?

>  The other alternative is that we can ignore that
> > for now and once the usage is clear we can enhance it.  What do you
> > think?
>
> That seems like a sensible option to me.
>

I have responded to that another thread.  Let us see if someone
responds to it.  Feel free to add if you have some points related to
that thread.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Ajin Cherian

Дата:

10 июля 2020 г., 06:51:08

On Thu, Jul 9, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> I think if the GUC is set then maybe we can bypass this check so that
> it can try to stream every single change?
>

Yeah and probably we need to do something for the check "while
(rb->size >= logical_decoding_work_mem * 1024L)" as well.

I have made this change, as discussed, the regression tests seem to run fine. I have added a debug that records the streaming for each transaction number. I also had to bypass certain asserts in ReorderBufferLargestTopTXN() as now we are going through the entire list of transactions and not just picking the biggest transaction .

regards,

Ajin

Fujitsu Australia

Вложения

v31-0015-TEST-guc-always-streaming-logical.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

10 июля 2020 г., 08:11:19

On Fri, Jul 10, 2020 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
>
>
> On Thu, Jul 9, 2020 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>
>> > I think if the GUC is set then maybe we can bypass this check so that
>> > it can try to stream every single change?
>> >
>>
>> Yeah and probably we need to do something for the check "while
>> (rb->size >= logical_decoding_work_mem * 1024L)" as well.
>>
>>
> I have made this change, as discussed, the regression tests seem to run fine. I have added a debug that records the
streamingfor each transaction >number. I also had to bypass certain asserts in ReorderBufferLargestTopTXN() as now we
aregoing through the entire list of transactions and not just picking the biggest transaction . 

So if always_stream_logical is true then we are always going for the
streaming even if the size is not reached and that is good.  And if
always_stream_logical is set then we are setting ctx->streaming=true
that is also good.  So now I don't think we need to change this part
of the code, because when we bypass the memory limit and set the
ctx->streaming=true it will always select the streaming option unless
it is impossible.  With your changes sometimes due to incomplete toast
changes, if it can not pick the largest top txn for streaming it will
hang forever in the while loop, in that case, it should go for
spilling.

while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Ajin Cherian

Дата:

10 июля 2020 г., 08:31:19

On Fri, Jul 10, 2020 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

With your changes sometimes due to incomplete toast
changes, if it can not pick the largest top txn for streaming it will
hang forever in the while loop, in that case, it should go for
spilling.

while (rb->size >= logical_decoding_work_mem * 1024L)
{
/*
* Pick the largest transaction (or subtransaction) and evict it from
* memory by streaming, if supported. Otherwise, spill to disk.
*/
if (ReorderBufferCanStream(rb) &&
(txn = ReorderBufferLargestTopTXN(rb)) != NULL)

Which is this condition (of not picking largest top txn)? Wouldn't ReorderBufferLargestTopTXN then return a NULL? If not, is there a way to know that a transaction cannot be streamed, so there can be an exit condition for the while loop?

regards,

Ajin Cherian

Fujitsu Australia

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

10 июля 2020 г., 09:18:32

On Fri, Jul 10, 2020 at 11:01 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
>
>
> On Fri, Jul 10, 2020 at 3:11 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> With your changes sometimes due to incomplete toast
>> changes, if it can not pick the largest top txn for streaming it will
>> hang forever in the while loop, in that case, it should go for
>> spilling.
>>
>> while (rb->size >= logical_decoding_work_mem * 1024L)
>> {
>> /*
>> * Pick the largest transaction (or subtransaction) and evict it from
>> * memory by streaming, if supported. Otherwise, spill to disk.
>> */
>> if (ReorderBufferCanStream(rb) &&
>> (txn = ReorderBufferLargestTopTXN(rb)) != NULL)
>>
>>
>
> Which is this condition (of not picking largest top txn)? Wouldn't ReorderBufferLargestTopTXN then return a NULL? If
not,is there a way to know that a transaction cannot be streamed, so there can be an exit condition for the while
loop?


Okay, I see, so if ReorderBufferLargestTopTXN returns NULL you are
breaking the loop.  I did not see the other part of the patch but I
agree that it will not go in an infinite loop.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

10 июля 2020 г., 13:07:19

On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 30, 2020 at 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Let me know what you think about the above changes.
> >
>
> I went ahead and made few changes in
> 0005-Implement-streaming-mode-in-ReorderBuffer which are explained
> below.  I have few questions and suggestions for the patch as well
> which are also covered in below points.
>
> 1.
> + if (prev_lsn == InvalidXLogRecPtr)
> + {
> + if (streaming)
> + rb->stream_start(rb, txn, change->lsn);
> + else
> + rb->begin(rb, txn);
> + stream_started = true;
> + }
>
> I don't think we want to move begin callback here that will change the
> existing semantics, so it is better to move begin at its original
> position. I have made the required changes in the attached patch.
>
> 2.
> ReorderBufferTruncateTXN()
> {
> ..
> + dlist_foreach_modify(iter, &txn->changes)
> + {
> + ReorderBufferChange *change;
> +
> + change = dlist_container(ReorderBufferChange, node, iter.cur);
> +
> + /* remove the change from it's containing list */
> + dlist_delete(&change->node);
> +
> + ReorderBufferReturnChange(rb, change);
> + }
> ..
> }
>
> I think here we can add an Assert that we're not mixing changes from
> different transactions.  See the changes in the patch.
>
> 3.
> SetupCheckXidLive()
> {
> ..
> + /*
> + * setup CheckXidAlive if it's not committed yet. We don't check if the xid
> + * aborted. That will happen during catalog access.  Also, reset the
> + * bsysscan flag.
> + */
> + if (!TransactionIdDidCommit(xid))
> + {
> + CheckXidAlive = xid;
> + bsysscan = false;
> ..
> }
>
> What is the need to reset bsysscan flag here if we are already
> resetting on error (like in the previous patch sent by me)?
>
> 4.
> ReorderBufferProcessTXN()
> {
> ..
> ..
> + /* Reset the CheckXidAlive */
> + if (streaming)
> + CheckXidAlive = InvalidTransactionId;
> ..
> }
>
> Similar to the previous point, we don't need this as well because
> AbortCurrentTransaction would have taken care of this.
>
> 5.
> + * XXX Do we need to check if the transaction has some changes to stream
> + * (maybe it got streamed right before the commit, which attempts to
> + * stream it again before the commit)?
> + */
> +static void
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
>
> The above comment doesn't make much sense to me, so I have removed it.
> Basically, if there are no changes before commit, we still need to
> send commit and anyway if there are no more changes
> ReorderBufferProcessTXN will not do anything.
>
> 6.
> +ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
> {
> ..
> if (txn->snapshot_now == NULL)
> + {
> + dlist_iter subxact_i;
> +
> + /* make sure this transaction is streamed for the first time */
> + Assert(!rbtxn_is_streamed(txn));
> +
> + /* at the beginning we should have invalid command ID */
> + Assert(txn->command_id == InvalidCommandId);
> +
> + dlist_foreach(subxact_i, &txn->subtxns)
> + {
> + ReorderBufferTXN *subtxn;
> +
> + subtxn = dlist_container(ReorderBufferTXN, node, subxact_i.cur);
> + ReorderBufferTransferSnapToParent(txn, subtxn);
> + }
> ..
> }
>
> Here, it is possible that there is no base_snapshot for txn, so we
> need a check for that similar to ReorderBufferCommit.
>
> 7.  Apart from the above, I made few changes in comments and ran pgindent.
>
> 8. We can't stream the transaction before we reach the
> SNAPBUILD_CONSISTENT state because some other output plugin can apply
> those changes unlike what we do with pgoutput plugin (which writes to
> file). And, I think applying the transactions without reaching a
> consistent state would be anyway wrong.  So, we should avoid that and
> if do that then we should have an Assert for streamed txns rather than
> sending abort for them in ReorderBufferForget.

I was analyzing this point so currently, we only enable streaming in
StartReplicationSlot so basically in CreateReplicationSlot the
streaming will be always off because by that time plugins are not yet
startup that will happen only on StartReplicationSlot.  See below
snippet from patch 0007.  However, I agree during start replication
slot we might decode some of the extra walls of the transaction for
which we already got the commit confirmation and we must have a way to
avoid that.  But I think we don't need to do anything for the
CONSISTENT snapshot point.  What's your thought on this?

@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
  WalSndPrepareWrite, WalSndWriteData,
  WalSndUpdateProgress);

+ /*
+ * Make sure streaming is disabled here - we may have the methods,
+ * but we don't have anywhere to send the data yet.
+ */
+ ctx->streaming = false;
+

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

12 июля 2020 г., 19:26:34

On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > >
> > > > 9.
> > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > > {
> > > > ..
> > > > + ReorderBufferToastReset(rb, txn);
> > > > + if (specinsert != NULL)
> > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > ..
> > > > }
> > > >
> > > > Why do we need to do these here when we wouldn't have been done for
> > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > >
> > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > gracefully and we are continuing with further decoding so we need to
> > > return this change back.
> > >
> >
> > Okay, then I suggest we should do these before calling stream_stop and
> > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > pattern similar to try block unless there is a reason for not doing
> > so.  Also, it would be good if we can initialize specinsert with NULL
> > after returning the change as we are doing at other places.
>
> Okay
>
> > > > 10.  I have got the below failure once.  I have not investigated this
> > > > in detail as the patch is still under progress.  See, if you have any
> > > > idea?
> > > > #   Failed test 'check extra columns contain local defaults'
> > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > #          got: '2|0'
> > > > #     expected: '1000|500'
> > > > # Looks like you failed 1 test of 2.
> > > > make[2]: *** [check] Error 1
> > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > make[1]: *** Waiting for unfinished jobs....
> > > > make: *** [check-world-src/test-recurse] Error 2
> > >
> > > Even I got the failure once and after that, it did not reproduce.  I
> > > have executed it multiple time but it did not reproduce again.  Are
> > > you able to reproduce it consistently?
> > >
> >
> > No, I am also not able to reproduce it consistently but I think this
> > can fail if a subscriber sends the replay_location before actually
> > replaying the changes.  First, I thought that extra send_feedback we
> > have in apply_handle_stream_commit might have caused this but I guess
> > that can't happen because we need the commit time location for that
> > and we are storing the same at the end of apply_handle_stream_commit
> > after applying all messages.  I am not sure what is going on here.  I
> > think we somehow need to reproduce this or some variant of this test
> > consistently to find the root cause.
>
> And I think it appeared first time for me,  so maybe either induced
> from past few versions so some changes in the last few versions might
> have exposed it.  I have noticed that almost 50% of the time I am able
> to reproduce after the clean build so I can trace back from which
> version it started appearing that way it will be easy to narrow down.

I think the reason for the failure is that we are not setting
remote_final_lsn, in the streaming mode.  I have put multiple logs and
executed in log and from logs it appeared that some of the logical wal
did not get replayed due to below check in
should_apply_changes_for_rel.
return (rel->state == SUBREL_STATE_READY || (rel->state ==
SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));

I still need to do the detailed analysis that why does this fail in
some cases,  basically, most of the time the rel->state is
SUBREL_STATE_READY so this check passes but whenever the state is
SUBREL_STATE_SYNCDONE it failed because we never update
remote_final_lsn.  I will try to set this value in
apply_handle_stream_commit and see whether it ever fails or not.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

13 июля 2020 г., 07:43:51

On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 8. We can't stream the transaction before we reach the
> > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > those changes unlike what we do with pgoutput plugin (which writes to
> > file). And, I think applying the transactions without reaching a
> > consistent state would be anyway wrong.  So, we should avoid that and
> > if do that then we should have an Assert for streamed txns rather than
> > sending abort for them in ReorderBufferForget.
>
> I was analyzing this point so currently, we only enable streaming in
> StartReplicationSlot so basically in CreateReplicationSlot the
> streaming will be always off because by that time plugins are not yet
> startup that will happen only on StartReplicationSlot.
>

What do you mean by 'startup' in the above sentence?  AFAICS, we do
call startup_cb_wrapper in CreateInitDecodingContext which is called
from both CreateReplicationSlot and create_logical_replication_slot
before the start of decoding.  In CreateInitDecodingContext, we call
StartupDecodingContext which should load the plugin.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

13 июля 2020 г., 08:09:43

On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > 8. We can't stream the transaction before we reach the
> > > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > > those changes unlike what we do with pgoutput plugin (which writes to
> > > file). And, I think applying the transactions without reaching a
> > > consistent state would be anyway wrong.  So, we should avoid that and
> > > if do that then we should have an Assert for streamed txns rather than
> > > sending abort for them in ReorderBufferForget.
> >
> > I was analyzing this point so currently, we only enable streaming in
> > StartReplicationSlot so basically in CreateReplicationSlot the
> > streaming will be always off because by that time plugins are not yet
> > startup that will happen only on StartReplicationSlot.
> >
>
> What do you mean by 'startup' in the above sentence?  AFAICS, we do
> call startup_cb_wrapper in CreateInitDecodingContext which is called
> from both CreateReplicationSlot and create_logical_replication_slot
> before the start of decoding.  In CreateInitDecodingContext, we call
> StartupDecodingContext which should load the plugin.

Yeah, you are right that we do call startup_cb_wrapper from
CreateInitDecodingContext as well.  I think I got confused by below
comment in patch 0007

@@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
WalSndPrepareWrite, WalSndWriteData,
WalSndUpdateProgress);
+ /*
+ * Make sure streaming is disabled here - we may have the methods,
+ * but we don't have anywhere to send the data yet.
+ */
+ ctx->streaming = false;
+

Basically, during CreateReplicationSlot we forcefully disable the
streaming with the comment "we don't have anywhere to send the data
yet".  So my point is during CreateReplicationSlot time the streaming
will always be off and once we are done with creating the slot we will
be having consistent snapshot.  So my point is can we just check that
while decoding unless the current LSN reaches the start_decoding_at
point we should not start streaming and after that we can start.  At
that time we can have an assert that the snapshot should be
CONSISTENT.  However, before doing that I need to check on this point
that why after creating slot we are setting ctx->streaming to false.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

13 июля 2020 г., 08:17:26

On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > >
> > > > > 9.
> > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > > > {
> > > > > ..
> > > > > + ReorderBufferToastReset(rb, txn);
> > > > > + if (specinsert != NULL)
> > > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > > ..
> > > > > }
> > > > >
> > > > > Why do we need to do these here when we wouldn't have been done for
> > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > > >
> > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > > gracefully and we are continuing with further decoding so we need to
> > > > return this change back.
> > > >
> > >
> > > Okay, then I suggest we should do these before calling stream_stop and
> > > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > > pattern similar to try block unless there is a reason for not doing
> > > so.  Also, it would be good if we can initialize specinsert with NULL
> > > after returning the change as we are doing at other places.
> >
> > Okay
> >
> > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > idea?
> > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > #          got: '2|0'
> > > > > #     expected: '1000|500'
> > > > > # Looks like you failed 1 test of 2.
> > > > > make[2]: *** [check] Error 1
> > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > make[1]: *** Waiting for unfinished jobs....
> > > > > make: *** [check-world-src/test-recurse] Error 2
> > > >
> > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > have executed it multiple time but it did not reproduce again.  Are
> > > > you able to reproduce it consistently?
> > > >
> > >
> > > No, I am also not able to reproduce it consistently but I think this
> > > can fail if a subscriber sends the replay_location before actually
> > > replaying the changes.  First, I thought that extra send_feedback we
> > > have in apply_handle_stream_commit might have caused this but I guess
> > > that can't happen because we need the commit time location for that
> > > and we are storing the same at the end of apply_handle_stream_commit
> > > after applying all messages.  I am not sure what is going on here.  I
> > > think we somehow need to reproduce this or some variant of this test
> > > consistently to find the root cause.
> >
> > And I think it appeared first time for me,  so maybe either induced
> > from past few versions so some changes in the last few versions might
> > have exposed it.  I have noticed that almost 50% of the time I am able
> > to reproduce after the clean build so I can trace back from which
> > version it started appearing that way it will be easy to narrow down.
>
> I think the reason for the failure is that we are not setting
> remote_final_lsn, in the streaming mode.  I have put multiple logs and
> executed in log and from logs it appeared that some of the logical wal
> did not get replayed due to below check in
> should_apply_changes_for_rel.
> return (rel->state == SUBREL_STATE_READY || (rel->state ==
> SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
>
> I still need to do the detailed analysis that why does this fail in
> some cases,  basically, most of the time the rel->state is
> SUBREL_STATE_READY so this check passes but whenever the state is
> SUBREL_STATE_SYNCDONE it failed because we never update
> remote_final_lsn.  I will try to set this value in
> apply_handle_stream_commit and see whether it ever fails or not.

I have verified that after setting the remote_final_lsn in the
apply_handle_stream_commit, I don't see that regression failure in
over 70 runs whereas without that change it failed 6 times in 50 runs.
Apart from this, I have noticed one more thing related to the same
point.  Basically, in the apply_handle_commit, we are calling
process_syncing_tables whereas we are not calling the same in
apply_handle_stream_commit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

13 июля 2020 г., 08:39:54

On Mon, Jul 13, 2020 at 10:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > 8. We can't stream the transaction before we reach the
> > > > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > > > those changes unlike what we do with pgoutput plugin (which writes to
> > > > file). And, I think applying the transactions without reaching a
> > > > consistent state would be anyway wrong.  So, we should avoid that and
> > > > if do that then we should have an Assert for streamed txns rather than
> > > > sending abort for them in ReorderBufferForget.
> > >
> > > I was analyzing this point so currently, we only enable streaming in
> > > StartReplicationSlot so basically in CreateReplicationSlot the
> > > streaming will be always off because by that time plugins are not yet
> > > startup that will happen only on StartReplicationSlot.
> > >
> >
> > What do you mean by 'startup' in the above sentence?  AFAICS, we do
> > call startup_cb_wrapper in CreateInitDecodingContext which is called
> > from both CreateReplicationSlot and create_logical_replication_slot
> > before the start of decoding.  In CreateInitDecodingContext, we call
> > StartupDecodingContext which should load the plugin.
>
> Yeah, you are right that we do call startup_cb_wrapper from
> CreateInitDecodingContext as well.  I think I got confused by below
> comment in patch 0007
>
> @@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
> WalSndPrepareWrite, WalSndWriteData,
> WalSndUpdateProgress);
> + /*
> + * Make sure streaming is disabled here - we may have the methods,
> + * but we don't have anywhere to send the data yet.
> + */
> + ctx->streaming = false;
> +
>
> Basically, during CreateReplicationSlot we forcefully disable the
> streaming with the comment "we don't have anywhere to send the data
> yet".  So my point is during CreateReplicationSlot time the streaming
> will always be off and once we are done with creating the slot we will
> be having consistent snapshot.  So my point is can we just check that
> while decoding unless the current LSN reaches the start_decoding_at
> point we should not start streaming and after that we can start.  At
> that time we can have an assert that the snapshot should be
> CONSISTENT.  However, before doing that I need to check on this point
> that why after creating slot we are setting ctx->streaming to false.
>

I think you can refer to commit message as well for that "We however
must explicitly disable streaming replication during replication slot
creation, even if the plugin supports it. We don't need to replicate
the changes accumulated during this phase, and moreover, we don't have
a replication connection open so we don't have where to send the data
anyway.".  I don't think this is a good way to hack the streaming flag
because for SQL API's, we don't have a good reason to disable the
streaming in this way.  I guess if we had a condition related to
reaching CONSISTENT snapshot during streaming then we won't need to
hack the streaming flag in this way.  Once we reach the CONSISTENT
snapshot state, we come out of the creation of a replication slot (see
how we use DecodingContextReady to achieve that) phase.  So, I feel we
should remove the ctx->streaming setting to false and add a CONSISTENT
snapshot check during streaming unless you have a reason for not doing
so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

13 июля 2020 г., 09:00:23

On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > >
> > > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > > idea?
> > > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > > #          got: '2|0'
> > > > > > #     expected: '1000|500'
> > > > > > # Looks like you failed 1 test of 2.
> > > > > > make[2]: *** [check] Error 1
> > > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > > make[1]: *** Waiting for unfinished jobs....
> > > > > > make: *** [check-world-src/test-recurse] Error 2
> > > > >
> > > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > > have executed it multiple time but it did not reproduce again.  Are
> > > > > you able to reproduce it consistently?
> > > > >
> > > >
...
..
> >
> > I think the reason for the failure is that we are not setting
> > remote_final_lsn, in the streaming mode.  I have put multiple logs and
> > executed in log and from logs it appeared that some of the logical wal
> > did not get replayed due to below check in
> > should_apply_changes_for_rel.
> > return (rel->state == SUBREL_STATE_READY || (rel->state ==
> > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
> >
> > I still need to do the detailed analysis that why does this fail in
> > some cases,  basically, most of the time the rel->state is
> > SUBREL_STATE_READY so this check passes but whenever the state is
> > SUBREL_STATE_SYNCDONE it failed because we never update
> > remote_final_lsn.  I will try to set this value in
> > apply_handle_stream_commit and see whether it ever fails or not.
>
> I have verified that after setting the remote_final_lsn in the
> apply_handle_stream_commit, I don't see that regression failure in
> over 70 runs whereas without that change it failed 6 times in 50 runs.
>

Your analysis and fix seem correct to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

13 июля 2020 г., 12:02:02

On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 10:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 10:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jul 10, 2020 at 3:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > 8. We can't stream the transaction before we reach the
> > > > > SNAPBUILD_CONSISTENT state because some other output plugin can apply
> > > > > those changes unlike what we do with pgoutput plugin (which writes to
> > > > > file). And, I think applying the transactions without reaching a
> > > > > consistent state would be anyway wrong.  So, we should avoid that and
> > > > > if do that then we should have an Assert for streamed txns rather than
> > > > > sending abort for them in ReorderBufferForget.
> > > >
> > > > I was analyzing this point so currently, we only enable streaming in
> > > > StartReplicationSlot so basically in CreateReplicationSlot the
> > > > streaming will be always off because by that time plugins are not yet
> > > > startup that will happen only on StartReplicationSlot.
> > > >
> > >
> > > What do you mean by 'startup' in the above sentence?  AFAICS, we do
> > > call startup_cb_wrapper in CreateInitDecodingContext which is called
> > > from both CreateReplicationSlot and create_logical_replication_slot
> > > before the start of decoding.  In CreateInitDecodingContext, we call
> > > StartupDecodingContext which should load the plugin.
> >
> > Yeah, you are right that we do call startup_cb_wrapper from
> > CreateInitDecodingContext as well.  I think I got confused by below
> > comment in patch 0007
> >
> > @@ -1016,6 +1016,12 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
> > WalSndPrepareWrite, WalSndWriteData,
> > WalSndUpdateProgress);
> > + /*
> > + * Make sure streaming is disabled here - we may have the methods,
> > + * but we don't have anywhere to send the data yet.
> > + */
> > + ctx->streaming = false;
> > +
> >
> > Basically, during CreateReplicationSlot we forcefully disable the
> > streaming with the comment "we don't have anywhere to send the data
> > yet".  So my point is during CreateReplicationSlot time the streaming
> > will always be off and once we are done with creating the slot we will
> > be having consistent snapshot.  So my point is can we just check that
> > while decoding unless the current LSN reaches the start_decoding_at
> > point we should not start streaming and after that we can start.  At
> > that time we can have an assert that the snapshot should be
> > CONSISTENT.  However, before doing that I need to check on this point
> > that why after creating slot we are setting ctx->streaming to false.
> >
>
> I think you can refer to commit message as well for that "We however
> must explicitly disable streaming replication during replication slot
> creation, even if the plugin supports it. We don't need to replicate
> the changes accumulated during this phase, and moreover, we don't have
> a replication connection open so we don't have where to send the data
> anyway.".  I don't think this is a good way to hack the streaming flag
> because for SQL API's, we don't have a good reason to disable the
> streaming in this way.  I guess if we had a condition related to
> reaching CONSISTENT snapshot during streaming then we won't need to
> hack the streaming flag in this way.  Once we reach the CONSISTENT
> snapshot state, we come out of the creation of a replication slot (see
> how we use DecodingContextReady to achieve that) phase.  So, I feel we
> should remove the ctx->streaming setting to false and add a CONSISTENT
> snapshot check during streaming unless you have a reason for not doing
> so.

I was worried about the point that streaming on/off is sent by the
subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
we keep streaming on during create then it may not be right.  But, I
agree with your point that it's better we can avoid streaming during
slot creation by CONSISTENT snapshot check instead of disabling this
way.  And, anyways as soon as we reach the consistent snapshot we will
stop processing further records so we will not attempt to stream
during slot creation.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

13 июля 2020 г., 12:26:40

On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I think you can refer to commit message as well for that "We however
> > must explicitly disable streaming replication during replication slot
> > creation, even if the plugin supports it. We don't need to replicate
> > the changes accumulated during this phase, and moreover, we don't have
> > a replication connection open so we don't have where to send the data
> > anyway.".  I don't think this is a good way to hack the streaming flag
> > because for SQL API's, we don't have a good reason to disable the
> > streaming in this way.  I guess if we had a condition related to
> > reaching CONSISTENT snapshot during streaming then we won't need to
> > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > snapshot state, we come out of the creation of a replication slot (see
> > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > should remove the ctx->streaming setting to false and add a CONSISTENT
> > snapshot check during streaming unless you have a reason for not doing
> > so.
>
> I was worried about the point that streaming on/off is sent by the
> subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> we keep streaming on during create then it may not be right.
>

Then, how is that used on the publisher-side?  AFAICS, the streaming
is enabled based on whether streaming callbacks are provided and we do
that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

13 июля 2020 г., 12:34:06

On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > I think you can refer to commit message as well for that "We however
> > > must explicitly disable streaming replication during replication slot
> > > creation, even if the plugin supports it. We don't need to replicate
> > > the changes accumulated during this phase, and moreover, we don't have
> > > a replication connection open so we don't have where to send the data
> > > anyway.".  I don't think this is a good way to hack the streaming flag
> > > because for SQL API's, we don't have a good reason to disable the
> > > streaming in this way.  I guess if we had a condition related to
> > > reaching CONSISTENT snapshot during streaming then we won't need to
> > > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > > snapshot state, we come out of the creation of a replication slot (see
> > > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > > should remove the ctx->streaming setting to false and add a CONSISTENT
> > > snapshot check during streaming unless you have a reason for not doing
> > > so.
> >
> > I was worried about the point that streaming on/off is sent by the
> > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> > we keep streaming on during create then it may not be right.
> >
>
> Then, how is that used on the publisher-side?  AFAICS, the streaming
> is enabled based on whether streaming callbacks are provided and we do
> that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.

Basically, first, we enable based on whether we have the callbacks or
not but later once we get the START REPLICATION command from the
subscriber then we set it to false if the streaming is not enabled
from the subscriber side.  You can refer below code in patch 0007.

pgoutput_startup
{
parse_output_parameters(ctx->output_plugin_options,
&data->protocol_version,
- &data->publication_names);
+ &data->publication_names,
+ &enable_streaming);
/* Check if we support requested protocol */
if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
@@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("publication_names parameter missing")));
+ /*
+ * Decide whether to enable streaming. It is disabled by default, in
+ * which case we just update the flag in decoding context. Otherwise
+ * we only allow it with sufficient version of the protocol, and when
+ * the output plugin supports it.
+ */
+ if (!enable_streaming)
+ ctx->streaming = false;
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

13 июля 2020 г., 13:30:03

On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > I think you can refer to commit message as well for that "We however
> > > > must explicitly disable streaming replication during replication slot
> > > > creation, even if the plugin supports it. We don't need to replicate
> > > > the changes accumulated during this phase, and moreover, we don't have
> > > > a replication connection open so we don't have where to send the data
> > > > anyway.".  I don't think this is a good way to hack the streaming flag
> > > > because for SQL API's, we don't have a good reason to disable the
> > > > streaming in this way.  I guess if we had a condition related to
> > > > reaching CONSISTENT snapshot during streaming then we won't need to
> > > > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > > > snapshot state, we come out of the creation of a replication slot (see
> > > > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > > > should remove the ctx->streaming setting to false and add a CONSISTENT
> > > > snapshot check during streaming unless you have a reason for not doing
> > > > so.
> > >
> > > I was worried about the point that streaming on/off is sent by the
> > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> > > we keep streaming on during create then it may not be right.
> > >
> >
> > Then, how is that used on the publisher-side?  AFAICS, the streaming
> > is enabled based on whether streaming callbacks are provided and we do
> > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.
>
> Basically, first, we enable based on whether we have the callbacks or
> not but later once we get the START REPLICATION command from the
> subscriber then we set it to false if the streaming is not enabled
> from the subscriber side.  You can refer below code in patch 0007.
>
> pgoutput_startup
> {
> parse_output_parameters(ctx->output_plugin_options,
> &data->protocol_version,
> - &data->publication_names);
> + &data->publication_names,
> + &enable_streaming);
> /* Check if we support requested protocol */
> if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
> @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
> OutputPluginOptions *opt,
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> errmsg("publication_names parameter missing")));
> + /*
> + * Decide whether to enable streaming. It is disabled by default, in
> + * which case we just update the flag in decoding context. Otherwise
> + * we only allow it with sufficient version of the protocol, and when
> + * the output plugin supports it.
> + */
> + if (!enable_streaming)
> + ctx->streaming = false;
> }
>

Okay, in that case, we can do both enable and disable streaming in
this function itself rather than allow the caller to later modify it.
I suggest similarly we can enable/disable it for SQL API in
pg_decode_startup via output_plugin_options.  This way it will look
consistent for both SQL APIs and for command-based replication.  If we
can do so, then probably adding an Assert for Consistent Snapshot
while performing streaming should be okay.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

13 июля 2020 г., 13:38:49

On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > I think you can refer to commit message as well for that "We however
> > > > > must explicitly disable streaming replication during replication slot
> > > > > creation, even if the plugin supports it. We don't need to replicate
> > > > > the changes accumulated during this phase, and moreover, we don't have
> > > > > a replication connection open so we don't have where to send the data
> > > > > anyway.".  I don't think this is a good way to hack the streaming flag
> > > > > because for SQL API's, we don't have a good reason to disable the
> > > > > streaming in this way.  I guess if we had a condition related to
> > > > > reaching CONSISTENT snapshot during streaming then we won't need to
> > > > > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > > > > snapshot state, we come out of the creation of a replication slot (see
> > > > > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > > > > should remove the ctx->streaming setting to false and add a CONSISTENT
> > > > > snapshot check during streaming unless you have a reason for not doing
> > > > > so.
> > > >
> > > > I was worried about the point that streaming on/off is sent by the
> > > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> > > > we keep streaming on during create then it may not be right.
> > > >
> > >
> > > Then, how is that used on the publisher-side?  AFAICS, the streaming
> > > is enabled based on whether streaming callbacks are provided and we do
> > > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.
> >
> > Basically, first, we enable based on whether we have the callbacks or
> > not but later once we get the START REPLICATION command from the
> > subscriber then we set it to false if the streaming is not enabled
> > from the subscriber side.  You can refer below code in patch 0007.
> >
> > pgoutput_startup
> > {
> > parse_output_parameters(ctx->output_plugin_options,
> > &data->protocol_version,
> > - &data->publication_names);
> > + &data->publication_names,
> > + &enable_streaming);
> > /* Check if we support requested protocol */
> > if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
> > @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
> > OutputPluginOptions *opt,
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > errmsg("publication_names parameter missing")));
> > + /*
> > + * Decide whether to enable streaming. It is disabled by default, in
> > + * which case we just update the flag in decoding context. Otherwise
> > + * we only allow it with sufficient version of the protocol, and when
> > + * the output plugin supports it.
> > + */
> > + if (!enable_streaming)
> > + ctx->streaming = false;
> > }
> >
>
> Okay, in that case, we can do both enable and disable streaming in
> this function itself rather than allow the caller to later modify it.
> I suggest similarly we can enable/disable it for SQL API in
> pg_decode_startup via output_plugin_options.  This way it will look
> consistent for both SQL APIs and for command-based replication.  If we
> can do so, then probably adding an Assert for Consistent Snapshot
> while performing streaming should be okay.

Sounds good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

14 июля 2020 г., 15:09:54

On Mon, Jul 13, 2020 at 4:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Okay, in that case, we can do both enable and disable streaming in
> > this function itself rather than allow the caller to later modify it.
> > I suggest similarly we can enable/disable it for SQL API in
> > pg_decode_startup via output_plugin_options.  This way it will look
> > consistent for both SQL APIs and for command-based replication.  If we
> > can do so, then probably adding an Assert for Consistent Snapshot
> > while performing streaming should be okay.
>
> Sounds good to me.
>

Please find the latest patches.  I have made changes only in the
subscriber-side patches (0007 and 0008 as per the current patch-set).
The main changes are:
1. As discussed above, remove SendFeedback call from apply_handle_stream_commit
2. In SharedFilesetInit, ensure to register callback once
3. In stream_open_file, change slight handling around MemoryContexts
4. Merged the subscriber-side patches.
5. Added/Edited comments in 0007 and 0008.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

v32.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

15 июля 2020 г., 06:59:03

On Tue, Jul 14, 2020 at 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 4:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Okay, in that case, we can do both enable and disable streaming in
> > > this function itself rather than allow the caller to later modify it.
> > > I suggest similarly we can enable/disable it for SQL API in
> > > pg_decode_startup via output_plugin_options.  This way it will look
> > > consistent for both SQL APIs and for command-based replication.  If we
> > > can do so, then probably adding an Assert for Consistent Snapshot
> > > while performing streaming should be okay.
> >
> > Sounds good to me.
> >
>
> Please find the latest patches.  I have made changes only in the
> subscriber-side patches (0007 and 0008 as per the current patch-set).
> The main changes are:
> 1. As discussed above, remove SendFeedback call from apply_handle_stream_commit
> 2. In SharedFilesetInit, ensure to register callback once
> 3. In stream_open_file, change slight handling around MemoryContexts
> 4. Merged the subscriber-side patches.
> 5. Added/Edited comments in 0007 and 0008.

I have reviewed your changes and those look good to me,  please find
the latest version of the patch set.  The major changes
- A couple of review comments fixed suggested upthread in 0003 and 0005.
- Handle the case of stop streaming until we reach to the
start_decoding_at LSN in 0005
- Simplified the 0006 by avoiding sending the transaction with
incomplete changes and added the comment atop
ReorderBufferLargestTopTXN
- Moved 0010 as 0007 and handled pending comments in the same.
- In 0009 I have fixed a couple of defects mentioned above.  And, one
additional defect that is,  if we do alter subscription streaming
off/on then it was not working.
- In 0009 sending the origin id.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v33.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

15 июля 2020 г., 07:00:34

On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > I think you can refer to commit message as well for that "We however
> > > > > must explicitly disable streaming replication during replication slot
> > > > > creation, even if the plugin supports it. We don't need to replicate
> > > > > the changes accumulated during this phase, and moreover, we don't have
> > > > > a replication connection open so we don't have where to send the data
> > > > > anyway.".  I don't think this is a good way to hack the streaming flag
> > > > > because for SQL API's, we don't have a good reason to disable the
> > > > > streaming in this way.  I guess if we had a condition related to
> > > > > reaching CONSISTENT snapshot during streaming then we won't need to
> > > > > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > > > > snapshot state, we come out of the creation of a replication slot (see
> > > > > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > > > > should remove the ctx->streaming setting to false and add a CONSISTENT
> > > > > snapshot check during streaming unless you have a reason for not doing
> > > > > so.
> > > >
> > > > I was worried about the point that streaming on/off is sent by the
> > > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> > > > we keep streaming on during create then it may not be right.
> > > >
> > >
> > > Then, how is that used on the publisher-side?  AFAICS, the streaming
> > > is enabled based on whether streaming callbacks are provided and we do
> > > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.
> >
> > Basically, first, we enable based on whether we have the callbacks or
> > not but later once we get the START REPLICATION command from the
> > subscriber then we set it to false if the streaming is not enabled
> > from the subscriber side.  You can refer below code in patch 0007.
> >
> > pgoutput_startup
> > {
> > parse_output_parameters(ctx->output_plugin_options,
> > &data->protocol_version,
> > - &data->publication_names);
> > + &data->publication_names,
> > + &enable_streaming);
> > /* Check if we support requested protocol */
> > if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
> > @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
> > OutputPluginOptions *opt,
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > errmsg("publication_names parameter missing")));
> > + /*
> > + * Decide whether to enable streaming. It is disabled by default, in
> > + * which case we just update the flag in decoding context. Otherwise
> > + * we only allow it with sufficient version of the protocol, and when
> > + * the output plugin supports it.
> > + */
> > + if (!enable_streaming)
> > + ctx->streaming = false;
> > }
> >
>
> Okay, in that case, we can do both enable and disable streaming in
> this function itself rather than allow the caller to later modify it.
> I suggest similarly we can enable/disable it for SQL API in
> pg_decode_startup via output_plugin_options.  This way it will look
> consistent for both SQL APIs and for command-based replication.  If we
> can do so, then probably adding an Assert for Consistent Snapshot
> while performing streaming should be okay.

Done this way In the latest patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

15 июля 2020 г., 07:05:09

On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > >
> > > > > > 9.
> > > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > > > > {
> > > > > > ..
> > > > > > + ReorderBufferToastReset(rb, txn);
> > > > > > + if (specinsert != NULL)
> > > > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > > > ..
> > > > > > }
> > > > > >
> > > > > > Why do we need to do these here when we wouldn't have been done for
> > > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > > > >
> > > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > > > gracefully and we are continuing with further decoding so we need to
> > > > > return this change back.
> > > > >
> > > >
> > > > Okay, then I suggest we should do these before calling stream_stop and
> > > > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > > > pattern similar to try block unless there is a reason for not doing
> > > > so.  Also, it would be good if we can initialize specinsert with NULL
> > > > after returning the change as we are doing at other places.
> > >
> > > Okay
> > >
> > > > > > 10.  I have got the below failure once.  I have not investigated this
> > > > > > in detail as the patch is still under progress.  See, if you have any
> > > > > > idea?
> > > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > > #          got: '2|0'
> > > > > > #     expected: '1000|500'
> > > > > > # Looks like you failed 1 test of 2.
> > > > > > make[2]: *** [check] Error 1
> > > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > > make[1]: *** Waiting for unfinished jobs....
> > > > > > make: *** [check-world-src/test-recurse] Error 2
> > > > >
> > > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > > have executed it multiple time but it did not reproduce again.  Are
> > > > > you able to reproduce it consistently?
> > > > >
> > > >
> > > > No, I am also not able to reproduce it consistently but I think this
> > > > can fail if a subscriber sends the replay_location before actually
> > > > replaying the changes.  First, I thought that extra send_feedback we
> > > > have in apply_handle_stream_commit might have caused this but I guess
> > > > that can't happen because we need the commit time location for that
> > > > and we are storing the same at the end of apply_handle_stream_commit
> > > > after applying all messages.  I am not sure what is going on here.  I
> > > > think we somehow need to reproduce this or some variant of this test
> > > > consistently to find the root cause.
> > >
> > > And I think it appeared first time for me,  so maybe either induced
> > > from past few versions so some changes in the last few versions might
> > > have exposed it.  I have noticed that almost 50% of the time I am able
> > > to reproduce after the clean build so I can trace back from which
> > > version it started appearing that way it will be easy to narrow down.
> >
> > I think the reason for the failure is that we are not setting
> > remote_final_lsn, in the streaming mode.  I have put multiple logs and
> > executed in log and from logs it appeared that some of the logical wal
> > did not get replayed due to below check in
> > should_apply_changes_for_rel.
> > return (rel->state == SUBREL_STATE_READY || (rel->state ==
> > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
> >
> > I still need to do the detailed analysis that why does this fail in
> > some cases,  basically, most of the time the rel->state is
> > SUBREL_STATE_READY so this check passes but whenever the state is
> > SUBREL_STATE_SYNCDONE it failed because we never update
> > remote_final_lsn.  I will try to set this value in
> > apply_handle_stream_commit and see whether it ever fails or not.
>
> I have verified that after setting the remote_final_lsn in the
> apply_handle_stream_commit, I don't see that regression failure in
> over 70 runs whereas without that change it failed 6 times in 50 runs.
> Apart from this, I have noticed one more thing related to the same
> point.  Basically, in the apply_handle_commit, we are calling
> process_syncing_tables whereas we are not calling the same in
 > apply_handle_stream_commit.

I have set the remote_final_lsn as well as called
process_syncing_tables, in apply_handle_stream_commit.  Please see the
latest patch set v33.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Ajin Cherian

Дата:

15 июля 2020 г., 14:21:31

On Wed, Jul 15, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Please see the
latest patch set v33.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

I have a minor comment. You've defined a new function ReorderBufferStartStreaming() but the function doesn't actually start streaming but is used to find out if you can start streaming and it returns a boolean. Can't you name it accordingly?

Probably ReorderBufferCanStartStreaming(). I understand that it internally calls ReorderBufferCanStream() which is similar sounding but I think that should not matter.

regards,

Ajin Cherian

Fujitsu Australia

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

15 июля 2020 г., 14:23:57

On Wed, Jul 15, 2020 at 4:51 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 2:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>>   Please see the
>> latest patch set v33.
>>
>>
>>
>
> I have a minor comment. You've defined a new function ReorderBufferStartStreaming() but the function doesn't actually
startstreaming but is used to find out if you can start streaming and it returns a boolean. Can't you name it
accordingly?
> Probably ReorderBufferCanStartStreaming(). I understand that it internally calls ReorderBufferCanStream() which is
similarsounding but I think that should not matter.
 
>

+1.  I am actually editing some of the patches and I have already
named it as you are suggesting.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

15 июля 2020 г., 16:29:33

On Wed, Jul 15, 2020 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> I have reviewed your changes and those look good to me,  please find
> the latest version of the patch set.
>

I have done an additional round of review and below are the changes I
made in the attached patch-set.
1. Changed comments in 0002.
2. In 0005, apart from changing a few comments and function name, I
have changed below code:
+ if (ReorderBufferCanStream(rb) &&
+ !SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr))
Here, I think it is better to compare it with EndRecPtr.  I feel in
boundary case the next record could be the same as start_decoding_at,
so why to avoid streaming in that case?
3. In 0006, made below changes:
    a. Removed function ReorderBufferFreeChange and added a new
parameter in ReorderBufferReturnChange to achieve the same purpose.
    b. Changed quite a few comments, function names, added additional
Asserts, and few other cosmetic changes.
4. In 0007, made below changes:
    a. Removed the unnecessary change in .gitignore
    b. Changed the newly added option name to "stream-change".

Apart from above, I have merged patches 0004, 0005, 0006 and 0007 as
those seems one functionality to me.  For the sake of review, the
patch-set that contains merged patches is attached separately as
v34-combined.

Let me know what you think of the changes?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

16 июля 2020 г., 09:52:47

On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > I have reviewed your changes and those look good to me,  please find
> > the latest version of the patch set.
> >
>
> I have done an additional round of review and below are the changes I
> made in the attached patch-set.
> 1. Changed comments in 0002.
> 2. In 0005, apart from changing a few comments and function name, I
> have changed below code:
> + if (ReorderBufferCanStream(rb) &&
> + !SnapBuildXactNeedsSkip(builder, ctx->reader->ReadRecPtr))
> Here, I think it is better to compare it with EndRecPtr.  I feel in
> boundary case the next record could be the same as start_decoding_at,
> so why to avoid streaming in that case?

Make sense to me

> 3. In 0006, made below changes:
>     a. Removed function ReorderBufferFreeChange and added a new
> parameter in ReorderBufferReturnChange to achieve the same purpose.
>     b. Changed quite a few comments, function names, added additional
> Asserts, and few other cosmetic changes.
> 4. In 0007, made below changes:
>     a. Removed the unnecessary change in .gitignore
>     b. Changed the newly added option name to "stream-change".
>
> Apart from above, I have merged patches 0004, 0005, 0006 and 0007 as
> those seems one functionality to me.  For the sake of review, the
> patch-set that contains merged patches is attached separately as
> v34-combined.
>
> Let me know what you think of the changes?

I have reviewed the changes and looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

16 июля 2020 г., 13:55:32

On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Let me know what you think of the changes?
>
> I have reviewed the changes and looks fine to me.
>

Thanks, I am planning to start committing a few of the infrastructure
patches (especially first two) by early next week as we have resolved
all the open issues and done an extensive review of the entire
patch-set.  In the attached version, there is a slight change in one
of the commit messages as compared to the previous version.  I would
like to describe in brief the first two patches for the sake of
convenience.  Let me know if you or anyone else sees any problems with
these.

The first patch in the series allows us to WAL-log subtransaction and
top-level XID association.  The logical decoding infrastructure needs
to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.  So we also write the assignment info into WAL
immediately, as part of the next WAL record (to minimize overhead)
only when *wal_level=logical*.  We can not remove the existing
XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
the hot standby snapshot.

The second patch writes WAL for invalidations at command end with
wal_level=logical.  When wal_level=logical, write invalidations at
command end into WAL so that decoding can use this information.  This
patch is required to allow the streaming of in-progress transactions
in logical decoding.  We still add the invalidations to the cache and
write them to WAL at commit time in RecordTransactionCommit(). This
uses the existing XLOG_INVALIDATIONS xlog record type, from the
RM_STANDBY_ID resource manager (see LogStandbyInvalidations for
details).  So existing code relying on those invalidations (e.g. redo)
does not need to be changed. The invalidations written at command end
uses a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID
resource manager. See LogLogicalInvalidations for details.  These new
xlog records are ignored by existing redo procedures, which still rely
on the invalidations written to commit records.  The invalidations are
decoded and accumulated in top-transaction, and then executed during
replay.  This obviates the need to decode the invalidations as part of
a commit record.

The performance testing has shown that there is no performance penalty
with either of the patches but there is some additional WAL which in
most cases is 2-5% but in worst cases and for some specific DDL's it
is up to 15% with the second patch, however, that happens at
wal_level=logical only.  We have considered an alternative to blow up
all caches on any DDL in WALSenders and that will have both CPU and
network overhead.  For detailed results and analysis see [1][2].

[1] - https://www.postgresql.org/message-id/CAKYtNAqWkPpPFrdEbpPrCan3G_QAcankZarRKKd7cj6vQigM7w%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1L3PoiBw6uogB7jD5rmdT-GmEF4kOEccS1AWKuBhSkQkQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

v35.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

20 июля 2020 г., 09:31:36

On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Let me know what you think of the changes?
> >
> > I have reviewed the changes and looks fine to me.
> >
>
> Thanks, I am planning to start committing a few of the infrastructure
> patches (especially first two) by early next week as we have resolved
> all the open issues and done an extensive review of the entire
> patch-set.  In the attached version, there is a slight change in one
> of the commit messages as compared to the previous version.  I would
> like to describe in brief the first two patches for the sake of
> convenience.  Let me know if you or anyone else sees any problems with
> these.
>
> The first patch in the series allows us to WAL-log subtransaction and
> top-level XID association.  The logical decoding infrastructure needs
> to know which top-level
> transaction the subxact belongs to, in order to decode all the
> changes. Until now that might be delayed until commit, due to the
> caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
> incremental decoding.  So we also write the assignment info into WAL
> immediately, as part of the next WAL record (to minimize overhead)
> only when *wal_level=logical*.  We can not remove the existing
> XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
> the hot standby snapshot.
>
> The second patch writes WAL for invalidations at command end with
> wal_level=logical.  When wal_level=logical, write invalidations at
> command end into WAL so that decoding can use this information.  This
> patch is required to allow the streaming of in-progress transactions
> in logical decoding.  We still add the invalidations to the cache and
> write them to WAL at commit time in RecordTransactionCommit(). This
> uses the existing XLOG_INVALIDATIONS xlog record type, from the
> RM_STANDBY_ID resource manager (see LogStandbyInvalidations for
> details).  So existing code relying on those invalidations (e.g. redo)
> does not need to be changed. The invalidations written at command end
> uses a new xlog record type XLOG_XACT_INVALIDATIONS, from RM_XACT_ID
> resource manager. See LogLogicalInvalidations for details.  These new
> xlog records are ignored by existing redo procedures, which still rely
> on the invalidations written to commit records.  The invalidations are
> decoded and accumulated in top-transaction, and then executed during
> replay.  This obviates the need to decode the invalidations as part of
> a commit record.
>
> The performance testing has shown that there is no performance penalty
> with either of the patches but there is some additional WAL which in
> most cases is 2-5% but in worst cases and for some specific DDL's it
> is up to 15% with the second patch, however, that happens at
> wal_level=logical only.  We have considered an alternative to blow up
> all caches on any DDL in WALSenders and that will have both CPU and
> network overhead.  For detailed results and analysis see [1][2].
>
> [1] - https://www.postgresql.org/message-id/CAKYtNAqWkPpPFrdEbpPrCan3G_QAcankZarRKKd7cj6vQigM7w%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/CAA4eK1L3PoiBw6uogB7jD5rmdT-GmEF4kOEccS1AWKuBhSkQkQ%40mail.gmail.com
>

The patch set required to rebase after committing the binary format
option support in the create subscription command.  I have rebased the
patch set on the latest head and also added a test case to test
streaming in binary format.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v36.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

20 июля 2020 г., 11:45:19

On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > Let me know what you think of the changes?
> > >
> > > I have reviewed the changes and looks fine to me.
> > >
> >
> > Thanks, I am planning to start committing a few of the infrastructure
> > patches (especially first two) by early next week as we have resolved
> > all the open issues and done an extensive review of the entire
> > patch-set.  In the attached version, there is a slight change in one
> > of the commit messages as compared to the previous version.  I would
> > like to describe in brief the first two patches for the sake of
> > convenience.  Let me know if you or anyone else sees any problems with
> > these.
> >
> > The first patch in the series allows us to WAL-log subtransaction and
> > top-level XID association.  The logical decoding infrastructure needs
> > to know which top-level
> > transaction the subxact belongs to, in order to decode all the
> > changes. Until now that might be delayed until commit, due to the
> > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
> > incremental decoding.  So we also write the assignment info into WAL
> > immediately, as part of the next WAL record (to minimize overhead)
> > only when *wal_level=logical*.  We can not remove the existing
> > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
> > the hot standby snapshot.
> >

Pushed, this patch.

> >
>
> The patch set required to rebase after committing the binary format
> option support in the create subscription command.  I have rebased the
> patch set on the latest head and also added a test case to test
> streaming in binary format.
>

While going through commit 9de77b5453, I noticed below change:

@@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
        PQfreemem(pubnames_literal);
        pfree(pubnames_str);

+       if (options->proto.logical.binary &&
+           PQserverVersion(conn->streamConn) >= 140000)
+           appendStringInfoString(&cmd, ", binary 'true'");
+

Now, the similar change in this patch series is as below:

@@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
  appendStringInfo(&cmd, "proto_version '%u'",
  options->proto.logical.proto_version);

+ if (options->proto.logical.streaming)
+ appendStringInfo(&cmd, ", streaming 'on'");
+

I think we also need a version check similar to commit 9de77b5453 to
ensure that we send the new option only when connected to a newer
version (>=14) primary server.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

20 июля 2020 г., 14:11:17

On Mon, Jul 20, 2020 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Let me know what you think of the changes?
> > > >
> > > > I have reviewed the changes and looks fine to me.
> > > >
> > >
> > > Thanks, I am planning to start committing a few of the infrastructure
> > > patches (especially first two) by early next week as we have resolved
> > > all the open issues and done an extensive review of the entire
> > > patch-set.  In the attached version, there is a slight change in one
> > > of the commit messages as compared to the previous version.  I would
> > > like to describe in brief the first two patches for the sake of
> > > convenience.  Let me know if you or anyone else sees any problems with
> > > these.
> > >
> > > The first patch in the series allows us to WAL-log subtransaction and
> > > top-level XID association.  The logical decoding infrastructure needs
> > > to know which top-level
> > > transaction the subxact belongs to, in order to decode all the
> > > changes. Until now that might be delayed until commit, due to the
> > > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
> > > incremental decoding.  So we also write the assignment info into WAL
> > > immediately, as part of the next WAL record (to minimize overhead)
> > > only when *wal_level=logical*.  We can not remove the existing
> > > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
> > > the hot standby snapshot.
> > >
>
> Pushed, this patch.
>
> > >
> >
> > The patch set required to rebase after committing the binary format
> > option support in the create subscription command.  I have rebased the
> > patch set on the latest head and also added a test case to test
> > streaming in binary format.
> >
>
> While going through commit 9de77b5453, I noticed below change:
>
> @@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
>         PQfreemem(pubnames_literal);
>         pfree(pubnames_str);
>
> +       if (options->proto.logical.binary &&
> +           PQserverVersion(conn->streamConn) >= 140000)
> +           appendStringInfoString(&cmd, ", binary 'true'");
> +
>
> Now, the similar change in this patch series is as below:
>
> @@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
>   appendStringInfo(&cmd, "proto_version '%u'",
>   options->proto.logical.proto_version);
>
> + if (options->proto.logical.streaming)
> + appendStringInfo(&cmd, ", streaming 'on'");
> +
>
> I think we also need a version check similar to commit 9de77b5453 to
> ensure that we send the new option only when connected to a newer
> version (>=14) primary server.

I have changed that in the attached patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v37.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

20 июля 2020 г., 16:15:52

On Mon, Jul 20, 2020 at 4:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jul 20, 2020 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 20, 2020 at 12:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Jul 16, 2020 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Jul 16, 2020 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jul 15, 2020 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > Let me know what you think of the changes?
> > > > >
> > > > > I have reviewed the changes and looks fine to me.
> > > > >
> > > >
> > > > Thanks, I am planning to start committing a few of the infrastructure
> > > > patches (especially first two) by early next week as we have resolved
> > > > all the open issues and done an extensive review of the entire
> > > > patch-set.  In the attached version, there is a slight change in one
> > > > of the commit messages as compared to the previous version.  I would
> > > > like to describe in brief the first two patches for the sake of
> > > > convenience.  Let me know if you or anyone else sees any problems with
> > > > these.
> > > >
> > > > The first patch in the series allows us to WAL-log subtransaction and
> > > > top-level XID association.  The logical decoding infrastructure needs
> > > > to know which top-level
> > > > transaction the subxact belongs to, in order to decode all the
> > > > changes. Until now that might be delayed until commit, due to the
> > > > caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
> > > > incremental decoding.  So we also write the assignment info into WAL
> > > > immediately, as part of the next WAL record (to minimize overhead)
> > > > only when *wal_level=logical*.  We can not remove the existing
> > > > XLOG_XACT_ASSIGNMENT WAL as that is required for avoiding overflow in
> > > > the hot standby snapshot.
> > > >
> >
> > Pushed, this patch.
> >
> > > >
> > >
> > > The patch set required to rebase after committing the binary format
> > > option support in the create subscription command.  I have rebased the
> > > patch set on the latest head and also added a test case to test
> > > streaming in binary format.
> > >
> >
> > While going through commit 9de77b5453, I noticed below change:
> >
> > @@ -424,6 +424,10 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
> >         PQfreemem(pubnames_literal);
> >         pfree(pubnames_str);
> >
> > +       if (options->proto.logical.binary &&
> > +           PQserverVersion(conn->streamConn) >= 140000)
> > +           appendStringInfoString(&cmd, ", binary 'true'");
> > +
> >
> > Now, the similar change in this patch series is as below:
> >
> > @@ -408,6 +408,9 @@ libpqrcv_startstreaming(WalReceiverConn *conn,
> >   appendStringInfo(&cmd, "proto_version '%u'",
> >   options->proto.logical.proto_version);
> >
> > + if (options->proto.logical.streaming)
> > + appendStringInfo(&cmd, ", streaming 'on'");
> > +
> >
> > I think we also need a version check similar to commit 9de77b5453 to
> > ensure that we send the new option only when connected to a newer
> > version (>=14) primary server.
>
> I have changed that in the attached patch.

There was one warning in release mode in the last version in 0004 so
attaching a new version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v38.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Ajin Cherian

Дата:

21 июля 2020 г., 06:37:57

On Mon, Jul 20, 2020 at 11:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

There was one warning in release mode in the last version in 0004 so
attaching a new version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Hello,

I have tried to rework the patch which did the stats for the streaming of logical replication but based on the new logical replication stats framework developed by Masahiko-san and rebased by Amit in [1]. This uses v38 of the streaming logical update patch as well as the v1 of the stats framework patch as base. I will rebase this as the stats framework is updated. Let me know if you have any comments.

regards,

Ajin Cherian

Fujitsu Australia

[1] - https://www.postgresql.org/message-id/flat/CA%2Bfd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg%40mail.gmail.com

Вложения

v1_streaming_stats_update.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

22 июля 2020 г., 06:48:32

On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> There was one warning in release mode in the last version in 0004 so
> attaching a new version.
>

Today, I was reviewing patch
v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
small problem with it.

+ /*
+ * Execute the invalidations for xid-less transactions,
+ * otherwise, accumulate them so that they can be processed at
+ * the commit time.
+ */
+ if (!ctx->fast_forward)
+ {
+ if (TransactionIdIsValid(xid))
+ {
+ ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
+   invals->nmsgs, invals->msgs);
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
+   buf->origptr);
+ }

I think we need to set ReorderBufferXidSetCatalogChanges even when
ctx->fast-forward is true because we are dependent on that flag for
snapshot build (see SnapBuildCommitTxn).  We are already doing the
same way in DecodeCommit where even though we skip adding
invalidations for fast-forward cases but we do set the flag to
indicate that this txn has catalog changes.  Is there any reason to do
things differently here?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

22 июля 2020 г., 07:50:02

On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > There was one warning in release mode in the last version in 0004 so
> > attaching a new version.
> >
>
> Today, I was reviewing patch
> v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
> small problem with it.
>
> + /*
> + * Execute the invalidations for xid-less transactions,
> + * otherwise, accumulate them so that they can be processed at
> + * the commit time.
> + */
> + if (!ctx->fast_forward)
> + {
> + if (TransactionIdIsValid(xid))
> + {
> + ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
> +   invals->nmsgs, invals->msgs);
> + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
> +   buf->origptr);
> + }
>
> I think we need to set ReorderBufferXidSetCatalogChanges even when
> ctx->fast-forward is true because we are dependent on that flag for
> snapshot build (see SnapBuildCommitTxn).  We are already doing the
> same way in DecodeCommit where even though we skip adding
> invalidations for fast-forward cases but we do set the flag to
> indicate that this txn has catalog changes.  Is there any reason to do
> things differently here?

I think it is wrong,  we should set the
ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v39.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

22 июля 2020 г., 13:50:18

On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > There was one warning in release mode in the last version in 0004 so
> > > attaching a new version.
> > >
> >
> > Today, I was reviewing patch
> > v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
> > small problem with it.
> >
> > + /*
> > + * Execute the invalidations for xid-less transactions,
> > + * otherwise, accumulate them so that they can be processed at
> > + * the commit time.
> > + */
> > + if (!ctx->fast_forward)
> > + {
> > + if (TransactionIdIsValid(xid))
> > + {
> > + ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
> > +   invals->nmsgs, invals->msgs);
> > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
> > +   buf->origptr);
> > + }
> >
> > I think we need to set ReorderBufferXidSetCatalogChanges even when
> > ctx->fast-forward is true because we are dependent on that flag for
> > snapshot build (see SnapBuildCommitTxn).  We are already doing the
> > same way in DecodeCommit where even though we skip adding
> > invalidations for fast-forward cases but we do set the flag to
> > indicate that this txn has catalog changes.  Is there any reason to do
> > things differently here?
>
> I think it is wrong,  we should set the
> ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.
>

Thanks for the change.  I have one more minor comment in the patch
0001-WAL-Log-invalidations-at-command-end-with-wal_le.

 /*
+ * Invalidations logged with wal_level=logical.
+ */
+typedef struct xl_xact_invalidations
+{
+ int nmsgs; /* number of shared inval msgs */
+ SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
+} xl_xact_invalidations;

I see that we already have a structure xl_xact_invals in the code
which has the same members, so I think it is better to use that
instead of defining a new one.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

22 июля 2020 г., 14:24:52

On Wed, Jul 22, 2020 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 22, 2020 at 10:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jul 22, 2020 at 9:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Jul 20, 2020 at 6:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > There was one warning in release mode in the last version in 0004 so
> > > > attaching a new version.
> > > >
> > >
> > > Today, I was reviewing patch
> > > v38-0001-WAL-Log-invalidations-at-command-end-with-wal_le and found a
> > > small problem with it.
> > >
> > > + /*
> > > + * Execute the invalidations for xid-less transactions,
> > > + * otherwise, accumulate them so that they can be processed at
> > > + * the commit time.
> > > + */
> > > + if (!ctx->fast_forward)
> > > + {
> > > + if (TransactionIdIsValid(xid))
> > > + {
> > > + ReorderBufferAddInvalidations(reorder, xid, buf->origptr,
> > > +   invals->nmsgs, invals->msgs);
> > > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
> > > +   buf->origptr);
> > > + }
> > >
> > > I think we need to set ReorderBufferXidSetCatalogChanges even when
> > > ctx->fast-forward is true because we are dependent on that flag for
> > > snapshot build (see SnapBuildCommitTxn).  We are already doing the
> > > same way in DecodeCommit where even though we skip adding
> > > invalidations for fast-forward cases but we do set the flag to
> > > indicate that this txn has catalog changes.  Is there any reason to do
> > > things differently here?
> >
> > I think it is wrong,  we should set the
> > ReorderBufferXidSetCatalogChanges, even if it is in fast-forward mode.
> >
>
> Thanks for the change.  I have one more minor comment in the patch
> 0001-WAL-Log-invalidations-at-command-end-with-wal_le.
>
>  /*
> + * Invalidations logged with wal_level=logical.
> + */
> +typedef struct xl_xact_invalidations
> +{
> + int nmsgs; /* number of shared inval msgs */
> + SharedInvalidationMessage msgs[FLEXIBLE_ARRAY_MEMBER];
> +} xl_xact_invalidations;
>
> I see that we already have a structure xl_xact_invals in the code
> which has the same members, so I think it is better to use that
> instead of defining a new one.

You are right.  I have changed it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v40.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

23 июля 2020 г., 09:01:58

On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> You are right.  I have changed it.
>

Thanks, I have pushed the second patch in this series which is
0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
patch.  I will continue working on remaining patches.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

24 июля 2020 г., 14:35:02

On Thu, Jul 23, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > You are right.  I have changed it.
> >
>
> Thanks, I have pushed the second patch in this series which is
> 0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
> patch.  I will continue working on remaining patches.
>

I have reviewed and made a number of changes in the next patch which
extends the logical decoding output plugin API with stream methods.
(v41-0001-Extend-the-logical-decoding-output-plugin-API-wi).

1. I think we need handling of include_xids and include_timestamp but
not skip_empty_xacts in the new APIs, as of now, none of the options
were respected.  We need 'include_xids' handling because we need to
include xid with stream messages and similarly 'include_timestamp' for
stream commit messages.  OTOH, I think we never use streaming mode for
empty xacts, so we don't need to bother about skip_empty_xacts in
streaming APIs.
2. Then I made a number of changes in documentation, comments, and
other cosmetic changes.

Kindly review/test and let me know if you see any problems with the
above changes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вложения

v41.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

24 июля 2020 г., 16:47:35

On Fri, Jul 24, 2020 at 5:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 23, 2020 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jul 22, 2020 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > You are right.  I have changed it.
> > >
> >
> > Thanks, I have pushed the second patch in this series which is
> > 0001-WAL-Log-invalidations-at-command-end-with-wal_le in your latest
> > patch.  I will continue working on remaining patches.
> >
>
> I have reviewed and made a number of changes in the next patch which
> extends the logical decoding output plugin API with stream methods.
> (v41-0001-Extend-the-logical-decoding-output-plugin-API-wi).
>
> 1. I think we need handling of include_xids and include_timestamp but
> not skip_empty_xacts in the new APIs, as of now, none of the options
> were respected.  We need 'include_xids' handling because we need to
> include xid with stream messages and similarly 'include_timestamp' for
> stream commit messages.  OTOH, I think we never use streaming mode for
> empty xacts, so we don't need to bother about skip_empty_xacts in
> streaming APIs.
> 2. Then I made a number of changes in documentation, comments, and
> other cosmetic changes.
>
> Kindly review/test and let me know if you see any problems with the
> above changes.

Your changes look fine to me.  Additionally, I have changed a test
case of getting the streaming changes in 0002.  Instead of just
showing the count, I am showing that the transaction is actually
streaming.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v42.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

25 июля 2020 г., 14:38:03

On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Your changes look fine to me.  Additionally, I have changed a test
> case of getting the streaming changes in 0002.  Instead of just
> showing the count, I am showing that the transaction is actually
> streaming.
>

If you want to show the changes then there is no need to display 157
rows probably a few (10-15) should be sufficient.  If we can do that
by increasing the size of the row then good, otherwise, I think it is
better to retain the test to display the count.

Today, I have again looked at the first patch
(v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
find any more problems with it so planning to commit the same unless
you or someone else want to add more to it.   Just for ease of others,
"the next patch extends the logical decoding output plugin API with
stream methods".   It adds seven methods to the output plugin API,
adding support for streaming changes for large in-progress
transactions. The methods are stream_start, stream_stop, stream_abort,
stream_commit, stream_change, stream_message, and stream_truncate.
Most of this is a simple extension of the existing methods, with the
semantic difference that the transaction (or subtransaction) is
incomplete and may be aborted later (which is something the regular
API does not really need to deal with).

This also extends the 'test_decoding' plugin, implementing these new
stream methods.  The stream_start/start_stop are used to demarcate a
chunk of changes streamed for a particular toplevel transaction.

This commit simply adds these new APIs and the upcoming patch to
"allow the streaming mode in ReorderBuffer" will use these APIs.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

26 июля 2020 г., 08:34:14

On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Your changes look fine to me.  Additionally, I have changed a test
> > case of getting the streaming changes in 0002.  Instead of just
> > showing the count, I am showing that the transaction is actually
> > streaming.
> >
>
> If you want to show the changes then there is no need to display 157
> rows probably a few (10-15) should be sufficient.  If we can do that
> by increasing the size of the row then good, otherwise, I think it is
> better to retain the test to display the count.

I think in existing test cases also we are displaying multiple lines
e.g. toast.out is showing 235 rows.  But maybe I will try to reduce it
to the less number of rows.

> Today, I have again looked at the first patch
> (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
> find any more problems with it so planning to commit the same unless
> you or someone else want to add more to it.   Just for ease of others,
> "the next patch extends the logical decoding output plugin API with
> stream methods".   It adds seven methods to the output plugin API,
> adding support for streaming changes for large in-progress
> transactions. The methods are stream_start, stream_stop, stream_abort,
> stream_commit, stream_change, stream_message, and stream_truncate.
> Most of this is a simple extension of the existing methods, with the
> semantic difference that the transaction (or subtransaction) is
> incomplete and may be aborted later (which is something the regular
> API does not really need to deal with).
>
> This also extends the 'test_decoding' plugin, implementing these new
> stream methods.  The stream_start/start_stop are used to demarcate a
> chunk of changes streamed for a particular toplevel transaction.
>
> This commit simply adds these new APIs and the upcoming patch to
> "allow the streaming mode in ReorderBuffer" will use these APIs.

LGTM

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

26 июля 2020 г., 16:43:28

On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jul 24, 2020 at 7:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Your changes look fine to me.  Additionally, I have changed a test
> > > case of getting the streaming changes in 0002.  Instead of just
> > > showing the count, I am showing that the transaction is actually
> > > streaming.
> > >
> >
> > If you want to show the changes then there is no need to display 157
> > rows probably a few (10-15) should be sufficient.  If we can do that
> > by increasing the size of the row then good, otherwise, I think it is
> > better to retain the test to display the count.
>
> I think in existing test cases also we are displaying multiple lines
> e.g. toast.out is showing 235 rows.  But maybe I will try to reduce it
> to the less number of rows.

Changed, now only 27 rows.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v43.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

28 июля 2020 г., 07:21:54

On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > Today, I have again looked at the first patch
> > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
> > find any more problems with it so planning to commit the same unless
> > you or someone else want to add more to it.   Just for ease of others,
> > "the next patch extends the logical decoding output plugin API with
> > stream methods".   It adds seven methods to the output plugin API,
> > adding support for streaming changes for large in-progress
> > transactions. The methods are stream_start, stream_stop, stream_abort,
> > stream_commit, stream_change, stream_message, and stream_truncate.
> > Most of this is a simple extension of the existing methods, with the
> > semantic difference that the transaction (or subtransaction) is
> > incomplete and may be aborted later (which is something the regular
> > API does not really need to deal with).
> >
> > This also extends the 'test_decoding' plugin, implementing these new
> > stream methods.  The stream_start/start_stop are used to demarcate a
> > chunk of changes streamed for a particular toplevel transaction.
> >
> > This commit simply adds these new APIs and the upcoming patch to
> > "allow the streaming mode in ReorderBuffer" will use these APIs.
>
> LGTM
>

Pushed.  Feel free to submit the remaining patches.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

29 июля 2020 г., 08:16:15

On Tue, Jul 28, 2020 at 9:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > > Today, I have again looked at the first patch
> > > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
> > > find any more problems with it so planning to commit the same unless
> > > you or someone else want to add more to it.   Just for ease of others,
> > > "the next patch extends the logical decoding output plugin API with
> > > stream methods".   It adds seven methods to the output plugin API,
> > > adding support for streaming changes for large in-progress
> > > transactions. The methods are stream_start, stream_stop, stream_abort,
> > > stream_commit, stream_change, stream_message, and stream_truncate.
> > > Most of this is a simple extension of the existing methods, with the
> > > semantic difference that the transaction (or subtransaction) is
> > > incomplete and may be aborted later (which is something the regular
> > > API does not really need to deal with).
> > >
> > > This also extends the 'test_decoding' plugin, implementing these new
> > > stream methods.  The stream_start/start_stop are used to demarcate a
> > > chunk of changes streamed for a particular toplevel transaction.
> > >
> > > This commit simply adds these new APIs and the upcoming patch to
> > > "allow the streaming mode in ReorderBuffer" will use these APIs.
> >
> > LGTM
> >
>
> Pushed.  Feel free to submit the remaining patches.

Thanks, please find the rebased patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v44.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Ajin Cherian

Дата:

30 июля 2020 г., 09:57:54

On Wed, Jul 29, 2020 at 3:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thanks, please find the rebased patch set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

I was running some tests on this patch. I was generally trying to see how the patch affects logical replication when doing bulk inserts. This issue has been raised in the past, for eg: this [1].

My test setup is:

1. Two postgres servers running - A and B

2. Create a pgbench setup on A. (pgbench -i -s 5 postgres)

3. replicate the 3 tables (schema only) on B.

4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, pgbench_branches and pgbench_tellers;

5. Three subscribers on B for the same tables. (streaming on and off based on the scenarios described below)

run pgbench with : pgbench -c 4 -T 100 postgres

While pgbench is running, Do a bulk insert on some other table not in the publication list (say t1); INSERT INTO t1 (select i FROM generate_series(1,10000000) i);

Four scenarios:

1. Pgbench with logical replication enabled without bulk insert

Avg TPS (out of 10 runs): 641 TPS

2.Pgbench without logical replication enabled with bulk insert (no pub/sub)

Avg TPS (out of 10 runs): 665 TPS

3, Pgbench with logical replication enabled with bulk insert

Avg TPS (out of 10 runs): 278 TPS

4. Pgbench with logical replication streaming on with bulk insert

Avg TPS (out of 10 runs): 440 TPS

As you can see, the bulk inserts, although on a totally unaffected table, does impact the TPS. But what is good is that, enabling streaming improves the TPS (about 58% improvement)

[1] - https://www.postgresql.org/message-id/flat/CAMsr%2BYE6aE6Re6smrMr-xCabRmCr%3DyzXEf2Yuv5upEDY5nMX8g%40mail.gmail.com#dbe51a181dd735eec8bb36f8a07bacf5

regards,

Ajin Cherian

Fujitsu Australia

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

30 июля 2020 г., 12:53:24

On Thu, Jul 30, 2020 at 12:28 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> I was running some tests on this patch. I was generally trying to see how the patch affects logical replication when
doingbulk inserts. This issue has been raised in the past, for eg: this [1].
 
> My test setup is:
> 1. Two postgres servers running - A and B
> 2. Create a pgbench setup on A. (pgbench -i -s 5 postgres)
> 3. replicate the 3 tables (schema only) on B.
> 4. Three publishers on A for the 3 tables of pgbench; pgbench_accounts, pgbench_branches and pgbench_tellers;
> 5. Three subscribers on B for the same tables. (streaming on and off based on the scenarios described below)
>
> run pgbench with : pgbench -c 4 -T 100 postgres
> While pgbench is running, Do a bulk insert on some other table not in the publication list (say t1); INSERT INTO t1
(selecti FROM generate_series(1,10000000) i);
 
>
> Four scenarios:
> 1. Pgbench with logical replication enabled without bulk insert
> Avg TPS (out of 10 runs): 641 TPS
> 2.Pgbench without logical replication enabled with bulk insert (no pub/sub)
> Avg TPS (out of 10 runs): 665 TPS
> 3, Pgbench with logical replication enabled with bulk insert
> Avg TPS (out of 10 runs): 278 TPS
> 4. Pgbench with logical replication streaming on with bulk insert
> Avg TPS (out of 10 runs): 440 TPS
>
> As you can see, the bulk inserts, although on a totally unaffected table, does impact the TPS. But what is good is
that,enabling streaming improves the TPS (about 58% improvement)
 
>

Thanks for doing these tests, it is a good win and probably the reason
is that after patch we won't serialize such big transactions (as shown
in Konstantin's email [1]) and they will be simply skipped.
Basically, it will try to stream such transactions and will skip them
as they are not required to be sent.

[1] - https://www.postgresql.org/message-id/5f5143cc-9f73-3909-3ef7-d3895cc6cc90%40postgrespro.ru

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Ajin Cherian

Дата:

31 июля 2020 г., 12:52:27

Attaching an updated patch for the stats for streaming based on v2 of Sawada's san replication slot stats framework and v44 of this patch series . This is one patch that has both the stats framework from Sawada-san (1) as well as my update for streaming, so it can be applied easily on top of v44.

regards,

Ajin Cherian

Fujitsu Australia

Вложения

streaming_stats_update.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

04 августа 2020 г., 07:42:37

On Wed, Jul 29, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Thanks, please find the rebased patch set.
>

Few comments on v44-0001-Implement-streaming-mode-in-ReorderBuffer:
============================================================
1.
+-- streaming with subxact, nothing in main
+BEGIN;
+savepoint s1;
+SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
+INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM
generate_series(1, 35) g(i);
+TRUNCATE table stream_test;
+rollback to s1;
+INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
generate_series(1, 20) g(i);
+COMMIT;

Is the above comment true?  Because it seems to me that Insert is
getting streamed in the main transaction.

2.
+<programlisting>
+postgres[33712]=#* SELECT * FROM
pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes',
'1');
+    lsn    | xid |                       data
+-----------+-----+--------------------------------------------------
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | streaming change for TXN 503
+ 0/16B2300 | 503 | streaming change for TXN 503
+ 0/16B2408 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+ 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
+ 0/16BECA8 | 503 | streaming change for TXN 503
+ 0/16BEDB0 | 503 | streaming change for TXN 503
+ 0/16BEEB8 | 503 | streaming change for TXN 503
+ 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
+(10 rows)
+</programlisting>
+ </para>
+

Is the above example correct?  Because we should include XID in the
stream message only when include_xids option is specified.

3.
 /*
- * Queue a change into a transaction so it can be replayed upon commit.
+ * Record the partial change for the streaming of in-progress transactions.  We
+ * can stream only complete changes so if we have a partial change like toast
+ * table insert or speculative then we mark such a 'txn' so that it can't be
+ * streamed.

/speculative then/speculative insert then

4.  I think we can explain the problems (like we can see the wrong
tuple or see two versions of the same tuple or whatever else wrong can
happen, if possible with some example) related to concurrent aborts
somewhere in comments.



-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

04 августа 2020 г., 10:11:46

On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 29, 2020 at 10:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Thanks, please find the rebased patch set.
> >
>
> Few comments on v44-0001-Implement-streaming-mode-in-ReorderBuffer:
> ============================================================
> 1.
> +-- streaming with subxact, nothing in main
> +BEGIN;
> +savepoint s1;
> +SELECT 'msg5' FROM pg_logical_emit_message(true, 'test', repeat('a', 50));
> +INSERT INTO stream_test SELECT repeat('a', 2000) || g.i FROM
> generate_series(1, 35) g(i);
> +TRUNCATE table stream_test;
> +rollback to s1;
> +INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM
> generate_series(1, 20) g(i);
> +COMMIT;
>
> Is the above comment true?  Because it seems to me that Insert is
> getting streamed in the main transaction.

Changed the comments.

> 2.
> +<programlisting>
> +postgres[33712]=#* SELECT * FROM
> pg_logical_slot_get_changes('test_slot', NULL, NULL, 'stream-changes',
> '1');
> +    lsn    | xid |                       data
> +-----------+-----+--------------------------------------------------
> + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
> + 0/16B21F8 | 503 | streaming change for TXN 503
> + 0/16B2300 | 503 | streaming change for TXN 503
> + 0/16B2408 | 503 | streaming change for TXN 503
> + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
> + 0/16B21F8 | 503 | opening a streamed block for transaction TXN 503
> + 0/16BECA8 | 503 | streaming change for TXN 503
> + 0/16BEDB0 | 503 | streaming change for TXN 503
> + 0/16BEEB8 | 503 | streaming change for TXN 503
> + 0/16BEBA0 | 503 | closing a streamed block for transaction TXN 503
> +(10 rows)
> +</programlisting>
> + </para>
> +
>
> Is the above example correct?  Because we should include XID in the
> stream message only when include_xids option is specified.

include_xids is true if we don't set it to false explicitly

> 3.
>  /*
> - * Queue a change into a transaction so it can be replayed upon commit.
> + * Record the partial change for the streaming of in-progress transactions.  We
> + * can stream only complete changes so if we have a partial change like toast
> + * table insert or speculative then we mark such a 'txn' so that it can't be
> + * streamed.
>
> /speculative then/speculative insert then

Done

> 4.  I think we can explain the problems (like we can see the wrong
> tuple or see two versions of the same tuple or whatever else wrong can
> happen, if possible with some example) related to concurrent aborts
> somewhere in comments.

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v45.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

05 августа 2020 г., 15:55:20

On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > 4.  I think we can explain the problems (like we can see the wrong
> > tuple or see two versions of the same tuple or whatever else wrong can
> > happen, if possible with some example) related to concurrent aborts
> > somewhere in comments.
>
> Done
>

I have slightly modified the comment added for the above point and
apart from that added/modified a few comments at other places.  I have
also slightly edited the commit message.

@@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb,
TransactionId xid,
  change->lsn = lsn;
  change->txn = txn;
  change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
+ change->txn = txn;

This change is not required as the same information is assigned a few
lines before.  So, I have removed this change as well.  Let me know
what you think of the above changes?

Can we add a test for incomplete changes (probably with toast
insertion but we can do it for spec_insert case as well) in
ReorderBuffer such that it needs to first serialize the changes and
then stream it?  I have manually verified such scenarios but it is
good to have the test for the same.

-- 
With Regards,
Amit Kapila.

Вложения

v46-0001-Implement-streaming-mode-in-ReorderBuffer.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

05 августа 2020 г., 17:07:37

On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 4, 2020 at 12:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Aug 4, 2020 at 10:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > 4.  I think we can explain the problems (like we can see the wrong
> > > tuple or see two versions of the same tuple or whatever else wrong can
> > > happen, if possible with some example) related to concurrent aborts
> > > somewhere in comments.
> >
> > Done
> >
>
> I have slightly modified the comment added for the above point and
> apart from that added/modified a few comments at other places.  I have
> also slightly edited the commit message.
>
> @@ -2196,6 +2778,7 @@ ReorderBufferAddNewTupleCids(ReorderBuffer *rb,
> TransactionId xid,
>   change->lsn = lsn;
>   change->txn = txn;
>   change->action = REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID;
> + change->txn = txn;
>
> This change is not required as the same information is assigned a few
> lines before.  So, I have removed this change as well.  Let me know
> what you think of the above changes?

Changes look fine to me.

> Can we add a test for incomplete changes (probably with toast
> insertion but we can do it for spec_insert case as well) in
> ReorderBuffer such that it needs to first serialize the changes and
> then stream it?  I have manually verified such scenarios but it is
> good to have the test for the same.

I have added a new test for the same in the stream.sql file.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v46.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

06 августа 2020 г., 12:13:17

On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > Can we add a test for incomplete changes (probably with toast
> > insertion but we can do it for spec_insert case as well) in
> > ReorderBuffer such that it needs to first serialize the changes and
> > then stream it?  I have manually verified such scenarios but it is
> > good to have the test for the same.
>
> I have added a new test for the same in the stream.sql file.
>

Thanks, I have slightly changed the test so that we can consume DDL
changes separately.  I have made a number of other adjustments like
changing few more comments (to make them consistent with nearby
comments), removed unnecessary inclusion of header file, ran pgindent.
The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in
this series looks good to me.  I am planning to push it after one more
read-through unless you or anyone else has any comments on the same.
The patch I am talking about has the following functionality:

Implement streaming mode in ReorderBuffer. Instead of serializing the
transaction to disk after reaching the logical_decoding_work_mem limit
in memory, we consume the changes we have in memory and invoke stream
API methods added by commit 45fdc9738b. However, sometimes if we have
incomplete toast or speculative insert we spill to the disk because we
can't stream till we have the complete tuple.  And, as soon as we get
the complete tuple we stream the transaction including the serialized
changes. Now that we can stream in-progress transactions, the
concurrent aborts may cause failures when the output plugin consults
catalogs (both system and user-defined). We handle such failures by
returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table
scan APIs to the backend or WALSender decoding a specific uncommitted
transaction. The decoding logic on the receipt of such a sqlerrcode
aborts the decoding of the current transaction and continues with the
decoding of other transactions. We also provide a new option via SQL
APIs to fetch the changes being streamed.

This patch's functionality can be independently verified by SQL APIs

-- 
With Regards,
Amit Kapila.

Вложения

v47.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

07 августа 2020 г., 11:34:40

On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > Can we add a test for incomplete changes (probably with toast
> > > insertion but we can do it for spec_insert case as well) in
> > > ReorderBuffer such that it needs to first serialize the changes and
> > > then stream it?  I have manually verified such scenarios but it is
> > > good to have the test for the same.
> >
> > I have added a new test for the same in the stream.sql file.
> >
>
> Thanks, I have slightly changed the test so that we can consume DDL
> changes separately.  I have made a number of other adjustments like
> changing few more comments (to make them consistent with nearby
> comments), removed unnecessary inclusion of header file, ran pgindent.
> The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in
> this series looks good to me.  I am planning to push it after one more
> read-through unless you or anyone else has any comments on the same.
> The patch I am talking about has the following functionality:
>
> Implement streaming mode in ReorderBuffer. Instead of serializing the
> transaction to disk after reaching the logical_decoding_work_mem limit
> in memory, we consume the changes we have in memory and invoke stream
> API methods added by commit 45fdc9738b. However, sometimes if we have
> incomplete toast or speculative insert we spill to the disk because we
> can't stream till we have the complete tuple.  And, as soon as we get
> the complete tuple we stream the transaction including the serialized
> changes. Now that we can stream in-progress transactions, the
> concurrent aborts may cause failures when the output plugin consults
> catalogs (both system and user-defined). We handle such failures by
> returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table
> scan APIs to the backend or WALSender decoding a specific uncommitted
> transaction. The decoding logic on the receipt of such a sqlerrcode
> aborts the decoding of the current transaction and continues with the
> decoding of other transactions. We also provide a new option via SQL
> APIs to fetch the changes being streamed.
>
> This patch's functionality can be independently verified by SQL APIs

Your changes look fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

13 августа 2020 г., 09:38:00

On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
..
> > This patch's functionality can be independently verified by SQL APIs
>
> Your changes look fine to me.
>

I have pushed that patch last week and attached are the remaining
patches. I have made a few changes in the next patch
0001-Extend-the-BufFile-interface.patch and have some comments on it
which are as below:

1.
  case SEEK_END:
- /* could be implemented, not needed currently */
+
+ /*
+ * Get the file size of the last file to get the last offset of
+ * that file.
+ */
+ newFile = file->numFiles - 1;
+ newOffset = FileSize(file->files[file->numFiles - 1]);
+ if (newOffset < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not determine size of temporary file \"%s\" from
BufFile \"%s\": %m",
+ FilePathName(file->files[file->numFiles - 1]),
+ file->name)));
+ break;
  break;

There is no need for multiple breaks in the above code. I have fixed
this one in the attached patch.

2.
+void
+BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
+{
+ int newFile = file->numFiles;
+ off_t newOffset = file->curOffset;
+ char segment_name[MAXPGPATH];
+ int i;
+
+ /* Loop over all the files upto the fileno which we want to truncate. */
+ for (i = file->numFiles - 1; i >= fileno; i--)
+ {
+ /*
+ * Except the fileno, we can directly delete other files.  If the
+ * offset is 0 then we can delete the fileno file as well unless it is
+ * the first file.
+ */
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }
+ else
+ {
+ if (FileTruncate(file->files[i], offset,
+ WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not truncate file \"%s\": %m",
+ FilePathName(file->files[i]))));
+
+ newOffset = offset;
+ }
+ }
+
+ file->numFiles = newFile;
+ file->curOffset = newOffset;
+}

In the end, you have only set 'numFiles' and 'curOffset' members of
BufFile and left others. I think other members like 'curFile' also
need to be set especially for the case where we have deleted segments
at the end, also, shouldn't we need to set 'pos' and 'nbytes' as we do
in BufFileSeek. If there is some reason that we don't to set these
other members then maybe it is better to add a comment to make it
clear.

Another thing we need to think here whether we need to flush the
buffer data for the dirty buffer? Consider a case where we truncate
the file up to a position that falls in the buffer. Now we might
truncate the file and part of buffer contents will become invalid,
next time if we flush such a buffer then the file can contain the
garbage or maybe this will be handled if we update the position in
buffer appropriately but all of this should be explained in comments.
If what I said is correct, then we still can skip buffer flush in some
cases as we do in BufFileSeek. Also, consider if we need to do other
handling (convert seek to "start of next seg" to "end of last seg") as
we do after changing the seek position in BufFileSeek.

3.
/*
 * Initialize a space for temporary files that can be opened by other backends.
 * Other backends must attach to it before accessing it.  Associate this
 * SharedFileSet with 'seg'.  Any contained files will be deleted when the
 * last backend detaches.
 *
 * We can also use this interface if the temporary files are used only by
 * single backend but the files need to be opened and closed multiple times
 * and also the underlying files need to survive across transactions.  For
 * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
 * files on proc exit.
 *
 * Files will be distributed over the tablespaces configured in
 * temp_tablespaces.
 *
 * Under the covers the set is one or more directories which will eventually
 * be deleted when there are no backends attached.
 */
void
SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
{
..

I think we can remove the part of the above comment after 'eventually
be deleted' (see last sentence in comment) because now the files can
be removed in more than one way and we have explained that in the
comments before this last sentence of the comment. If you can rephrase
it differently to cover the other case as well, then that is fine too.

-- 
With Regards,
Amit Kapila.

On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> ..
> > > This patch's functionality can be independently verified by SQL APIs
> >
> > Your changes look fine to me.
> >
>
> I have pushed that patch last week and attached are the remaining
> patches. I have made a few changes in the next patch
> 0001-Extend-the-BufFile-interface.patch and have some comments on it
> which are as below:
>
> 1.
>   case SEEK_END:
> - /* could be implemented, not needed currently */
> +
> + /*
> + * Get the file size of the last file to get the last offset of
> + * that file.
> + */
> + newFile = file->numFiles - 1;
> + newOffset = FileSize(file->files[file->numFiles - 1]);
> + if (newOffset < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not determine size of temporary file \"%s\" from
> BufFile \"%s\": %m",
> + FilePathName(file->files[file->numFiles - 1]),
> + file->name)));
> + break;
>   break;
>
> There is no need for multiple breaks in the above code. I have fixed
> this one in the attached patch.

Ok.

> 2.
> +void
> +BufFileTruncateShared(BufFile *file, int fileno, off_t offset)
> +{
> + int newFile = file->numFiles;
> + off_t newOffset = file->curOffset;
> + char segment_name[MAXPGPATH];
> + int i;
> +
> + /* Loop over all the files upto the fileno which we want to truncate. */
> + for (i = file->numFiles - 1; i >= fileno; i--)
> + {
> + /*
> + * Except the fileno, we can directly delete other files.  If the
> + * offset is 0 then we can delete the fileno file as well unless it is
> + * the first file.
> + */
> + if ((i != fileno || offset == 0) && fileno != 0)
> + {
> + SharedSegmentName(segment_name, file->name, i);
> + FileClose(file->files[i]);
> + if (!SharedFileSetDelete(file->fileset, segment_name, true))
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not delete shared fileset \"%s\": %m",
> + segment_name)));
> + newFile--;
> + newOffset = MAX_PHYSICAL_FILESIZE;
> + }
> + else
> + {
> + if (FileTruncate(file->files[i], offset,
> + WAIT_EVENT_BUFFILE_TRUNCATE) < 0)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not truncate file \"%s\": %m",
> + FilePathName(file->files[i]))));
> +
> + newOffset = offset;
> + }
> + }
> +
> + file->numFiles = newFile;
> + file->curOffset = newOffset;
> +}
>
> In the end, you have only set 'numFiles' and 'curOffset' members of
> BufFile and left others. I think other members like 'curFile' also
> need to be set especially for the case where we have deleted segments
> at the end,

Yes this must be set.

 also, shouldn't we need to set 'pos' and 'nbytes' as we do
> in BufFileSeek. If there is some reason that we don't to set these
> other members then maybe it is better to add a comment to make it
> clear.

IMHO, we can directly call the BufFileFlush, this will reset the pos
and nbytes and we can directly set the absolute location of the
curOffset.  Next time BufFileRead/BufFileWrite reread the buffer so
everything will be fine.

> Another thing we need to think here whether we need to flush the
> buffer data for the dirty buffer? Consider a case where we truncate
> the file up to a position that falls in the buffer. Now we might
> truncate the file and part of buffer contents will become invalid,
> next time if we flush such a buffer then the file can contain the
> garbage or maybe this will be handled if we update the position in
> buffer appropriately but all of this should be explained in comments.
> If what I said is correct, then we still can skip buffer flush in some
> cases as we do in BufFileSeek.

I think all the cases we can flush the buffer and reset the pos and nbytes.

 Also, consider if we need to do other
> handling (convert seek to "start of next seg" to "end of last seg") as
> we do after changing the seek position in BufFileSeek.

We also do this when we truncate complete file, see this
+ if ((i != fileno || offset == 0) && fileno != 0)
+ {
+ SharedSegmentName(segment_name, file->name, i);
+ FileClose(file->files[i]);
+ if (!SharedFileSetDelete(file->fileset, segment_name, true))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not delete shared fileset \"%s\": %m",
+ segment_name)));
+ newFile--;
+ newOffset = MAX_PHYSICAL_FILESIZE;
+ }

> 3.
> /*
>  * Initialize a space for temporary files that can be opened by other backends.
>  * Other backends must attach to it before accessing it.  Associate this
>  * SharedFileSet with 'seg'.  Any contained files will be deleted when the
>  * last backend detaches.
>  *
>  * We can also use this interface if the temporary files are used only by
>  * single backend but the files need to be opened and closed multiple times
>  * and also the underlying files need to survive across transactions.  For
>  * such cases, dsm segment 'seg' should be passed as NULL.  We remove such
>  * files on proc exit.
>  *
>  * Files will be distributed over the tablespaces configured in
>  * temp_tablespaces.
>  *
>  * Under the covers the set is one or more directories which will eventually
>  * be deleted when there are no backends attached.
>  */
> void
> SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg)
> {
> ..
>
> I think we can remove the part of the above comment after 'eventually
> be deleted' (see last sentence in comment) because now the files can
> be removed in more than one way and we have explained that in the
> comments before this last sentence of the comment. If you can rephrase
> it differently to cover the other case as well, then that is fine too.

I think it makes sense to remove, so I have removed it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

15 августа 2020 г., 13:02:17

On Thu, Aug 13, 2020 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > ..
> > > > This patch's functionality can be independently verified by SQL APIs
> > >
> > > Your changes look fine to me.
> > >
> >
> > I have pushed that patch last week and attached are the remaining
> > patches. I have made a few changes in the next patch
> > 0001-Extend-the-BufFile-interface.patch and have some comments on it
> > which are as below:
> >
>
> Few more comments on the latest patches:
> v48-0002-Add-support-for-streaming-to-built-in-replicatio
> 1. It appears to me that we don't remove the temporary folders created
> by the apply worker. So, we have folders like
> pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when
> the apply worker exits. I think we can remove these by calling
> PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing
> the fileset from registered filesetlist.

I think we need to call SharedFileSetDeleteAll(input_fileset), from
SharedFileSetUnregister, so that all the directories created for this
fileset are removed

> 2.
> +typedef struct SubXactInfo
> +{
> + TransactionId xid; /* XID of the subxact */
> + int fileno; /* file number in the buffile */
> + off_t offset; /* offset in the file */
> +} SubXactInfo;
> +
> +static uint32 nsubxacts = 0;
> +static uint32 nsubxacts_max = 0;
> +static SubXactInfo *subxacts = NULL;
> +static TransactionId subxact_last = InvalidTransactionId;
>
> Will it be better if we move all the subxact related variables (like
> nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct
> as all the information anyway is related to sub-transactions?

I have moved them all to a structure.

> 3.
> + /*
> + * If there is no subtransaction then nothing to do,  but if already have
> + * subxact file then delete that.
> + */
>
> extra space before 'but' in the above sentence is not required.

Fixed

> v48-0001-Extend-the-BufFile-interface
> 4.
> - * SharedFileSets can also be used by backends when the temporary files need
> - * to be opened/closed multiple times and the underlying files need to survive
> + * SharedFileSets can be used by backends when the temporary files need to be
> + * opened/closed multiple times and the underlying files need to survive
>   * across transactions.
>   *
>
> No need of 'also' in the above sentence.

Fixed


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

17 августа 2020 г., 15:58:47

On Sat, Aug 15, 2020 at 3:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 13, 2020 at 6:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 13, 2020 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Aug 7, 2020 at 2:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > ..
> > > > > This patch's functionality can be independently verified by SQL APIs
> > > >
> > > > Your changes look fine to me.
> > > >
> > >
> > > I have pushed that patch last week and attached are the remaining
> > > patches. I have made a few changes in the next patch
> > > 0001-Extend-the-BufFile-interface.patch and have some comments on it
> > > which are as below:
> > >
> >
> > Few more comments on the latest patches:
> > v48-0002-Add-support-for-streaming-to-built-in-replicatio
> > 1. It appears to me that we don't remove the temporary folders created
> > by the apply worker. So, we have folders like
> > pgsql_tmp15324.0.sharedfileset in base/pgsql_tmp directory even when
> > the apply worker exits. I think we can remove these by calling
> > PathNameDeleteTemporaryDir in SharedFileSetUnregister while removing
> > the fileset from registered filesetlist.
>
> I think we need to call SharedFileSetDeleteAll(input_fileset), from
> SharedFileSetUnregister, so that all the directories created for this
> fileset are removed
>
> > 2.
> > +typedef struct SubXactInfo
> > +{
> > + TransactionId xid; /* XID of the subxact */
> > + int fileno; /* file number in the buffile */
> > + off_t offset; /* offset in the file */
> > +} SubXactInfo;
> > +
> > +static uint32 nsubxacts = 0;
> > +static uint32 nsubxacts_max = 0;
> > +static SubXactInfo *subxacts = NULL;
> > +static TransactionId subxact_last = InvalidTransactionId;
> >
> > Will it be better if we move all the subxact related variables (like
> > nsubxacts, nsubxacts_max and subxact_last) inside SubXactInfo struct
> > as all the information anyway is related to sub-transactions?
>
> I have moved them all to a structure.
>
> > 3.
> > + /*
> > + * If there is no subtransaction then nothing to do,  but if already have
> > + * subxact file then delete that.
> > + */
> >
> > extra space before 'but' in the above sentence is not required.
>
> Fixed
>
> > v48-0001-Extend-the-BufFile-interface
> > 4.
> > - * SharedFileSets can also be used by backends when the temporary files need
> > - * to be opened/closed multiple times and the underlying files need to survive
> > + * SharedFileSets can be used by backends when the temporary files need to be
> > + * opened/closed multiple times and the underlying files need to survive
> >   * across transactions.
> >   *
> >
> > No need of 'also' in the above sentence.
>
> Fixed
>

In last patch v49-0001, there is one issue,  Basically, I have called
BufFileFlush in all the cases.  But, ideally, we can not call this if
the underlying files are deleted/truncated because those files/blocks
might not exist now.  So I think if the truncate position is within
the same buffer we just need to adjust the buffer,  otherwise we just
need to set the currFile and currOffset to the absolute number and set
the pos and nbytes 0.  Attached patch fixes this issue.

+ errmsg("could not truncate file \"%s\": %m",
+ FilePathName(file->files[i]))));
+ curOffset = offset;
+ }
+ }
+
+ /* Otherwise, must reposition buffer, so flush any dirty data */
+ BufFileFlush(file);
+

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

v50.tar

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

19 августа 2020 г., 07:40:43

On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> In last patch v49-0001, there is one issue,  Basically, I have called
> BufFileFlush in all the cases.  But, ideally, we can not call this if
> the underlying files are deleted/truncated because those files/blocks
> might not exist now.  So I think if the truncate position is within
> the same buffer we just need to adjust the buffer,  otherwise we just
> need to set the currFile and currOffset to the absolute number and set
> the pos and nbytes 0.  Attached patch fixes this issue.
>

Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
1.
+
+ /*
+ * If the truncate point is within existing buffer then we can just
+ * adjust pos-within-buffer, without flushing buffer.  Otherwise,
+ * we don't need to do anything because we have already deleted/truncated
+ * the underlying files.
+ */
+ if (curFile == file->curFile &&
+ curOffset >= file->curOffset &&
+ curOffset <= file->curOffset + file->nbytes)
+ {
+ file->pos = (int) (curOffset - file->curOffset);
+ return;
+ }

I think in this case you have set the position correctly but what
about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
because the contents of the buffer are still valid but I don't think
the same is true here.

2.
+ int curFile = file->curFile;
+ off_t curOffset = file->curOffset;

I find the previous naming (newFile, newOffset) was better as it
distinguishes them from BufFile variables.

3.
+void
+SharedFileSetUnregister(SharedFileSet *input_fileset)
+{
..
+ /* Delete all files in the set */
+ SharedFileSetDeleteAll(input_fileset);
..
}

I am not sure if this is completely correct because we call this
function (SharedFileSetUnregister) from BufFileDeleteShared which
would have already removed all the required files. This raises the
question in my mind whether it is correct to call
SharedFileSetUnregister from BufFileDeleteShared from the API
perspective as one might not want to remove the entire fileset at that
point of time. It will work for your use case (where while removing
buffile you also want to remove the entire fileset) but not sure if it
is generic enough. For your case, I wonder if we can directly call
SharedFileSetDeleteAll and we can have a call like
SharedFileSetUnregister which will be called from it.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

19 августа 2020 г., 09:49:48

On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > In last patch v49-0001, there is one issue,  Basically, I have called
> > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > the underlying files are deleted/truncated because those files/blocks
> > might not exist now.  So I think if the truncate position is within
> > the same buffer we just need to adjust the buffer,  otherwise we just
> > need to set the currFile and currOffset to the absolute number and set
> > the pos and nbytes 0.  Attached patch fixes this issue.
> >
>
> Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> 1.
> +
> + /*
> + * If the truncate point is within existing buffer then we can just
> + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> + * we don't need to do anything because we have already deleted/truncated
> + * the underlying files.
> + */
> + if (curFile == file->curFile &&
> + curOffset >= file->curOffset &&
> + curOffset <= file->curOffset + file->nbytes)
> + {
> + file->pos = (int) (curOffset - file->curOffset);
> + return;
> + }
>
> I think in this case you have set the position correctly but what
> about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> because the contents of the buffer are still valid but I don't think
> the same is true here.
>

I think you need to set 'nbytes' to curOffset as per your current
patch as that is the new size of the file.
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
off_t offset)
                curOffset <= file->curOffset + file->nbytes)
        {
                file->pos = (int) (curOffset - file->curOffset);
+               file->nbytes = (int) curOffset;
                return;
        }

Also, what about file 'numFiles', that can also change due to the
removal of certain files, shouldn't that be also set in this case?

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

19 августа 2020 г., 11:04:31

On Wed, Aug 19, 2020 at 10:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> >
> > In last patch v49-0001, there is one issue,  Basically, I have called
> > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > the underlying files are deleted/truncated because those files/blocks
> > might not exist now.  So I think if the truncate position is within
> > the same buffer we just need to adjust the buffer,  otherwise we just
> > need to set the currFile and currOffset to the absolute number and set
> > the pos and nbytes 0.  Attached patch fixes this issue.
> >
>
> Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> 1.
> +
> + /*
> + * If the truncate point is within existing buffer then we can just
> + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> + * we don't need to do anything because we have already deleted/truncated
> + * the underlying files.
> + */
> + if (curFile == file->curFile &&
> + curOffset >= file->curOffset &&
> + curOffset <= file->curOffset + file->nbytes)
> + {
> + file->pos = (int) (curOffset - file->curOffset);
> + return;
> + }
>
> I think in this case you have set the position correctly but what
> about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> because the contents of the buffer are still valid but I don't think
> the same is true here.

Right, I think we need to set nbytes to new file->pos as shown below

> + file->pos = (int) (curOffset - file->curOffset);
>  file->nbytes = file->pos


> 2.
> + int curFile = file->curFile;
> + off_t curOffset = file->curOffset;
>
> I find the previous naming (newFile, newOffset) was better as it
> distinguishes them from BufFile variables.

Ok

> 3.
> +void
> +SharedFileSetUnregister(SharedFileSet *input_fileset)
> +{
> ..
> + /* Delete all files in the set */
> + SharedFileSetDeleteAll(input_fileset);
> ..
> }
>
> I am not sure if this is completely correct because we call this
> function (SharedFileSetUnregister) from BufFileDeleteShared which
> would have already removed all the required files. This raises the
> question in my mind whether it is correct to call
> SharedFileSetUnregister from BufFileDeleteShared from the API
> perspective as one might not want to remove the entire fileset at that
> point of time. It will work for your use case (where while removing
> buffile you also want to remove the entire fileset) but not sure if it
> is generic enough. For your case, I wonder if we can directly call
> SharedFileSetDeleteAll and we can have a call like
> SharedFileSetUnregister which will be called from it.

Yeah this make more sense to me that we can directly call
SharedFileSetDeleteAll, instead of calling BufFileDeleteShared and we
can call SharedFileSetUnregister from SharedFileSetDeleteAll.

I will make these changes and send the patch after some testing.




--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

19 августа 2020 г., 11:05:39

On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > >
> > > In last patch v49-0001, there is one issue,  Basically, I have called
> > > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > > the underlying files are deleted/truncated because those files/blocks
> > > might not exist now.  So I think if the truncate position is within
> > > the same buffer we just need to adjust the buffer,  otherwise we just
> > > need to set the currFile and currOffset to the absolute number and set
> > > the pos and nbytes 0.  Attached patch fixes this issue.
> > >
> >
> > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> > 1.
> > +
> > + /*
> > + * If the truncate point is within existing buffer then we can just
> > + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> > + * we don't need to do anything because we have already deleted/truncated
> > + * the underlying files.
> > + */
> > + if (curFile == file->curFile &&
> > + curOffset >= file->curOffset &&
> > + curOffset <= file->curOffset + file->nbytes)
> > + {
> > + file->pos = (int) (curOffset - file->curOffset);
> > + return;
> > + }
> >
> > I think in this case you have set the position correctly but what
> > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> > because the contents of the buffer are still valid but I don't think
> > the same is true here.
> >
>
> I think you need to set 'nbytes' to curOffset as per your current
> patch as that is the new size of the file.
> --- a/src/backend/storage/file/buffile.c
> +++ b/src/backend/storage/file/buffile.c
> @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
> off_t offset)
>                 curOffset <= file->curOffset + file->nbytes)
>         {
>                 file->pos = (int) (curOffset - file->curOffset);
> +               file->nbytes = (int) curOffset;
>                 return;
>         }
>
> Also, what about file 'numFiles', that can also change due to the
> removal of certain files, shouldn't that be also set in this case

Right, we need to set the numFile.  I will fix this as well.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

20 августа 2020 г., 11:10:45

On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > >
> > > > In last patch v49-0001, there is one issue,  Basically, I have called
> > > > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > > > the underlying files are deleted/truncated because those files/blocks
> > > > might not exist now.  So I think if the truncate position is within
> > > > the same buffer we just need to adjust the buffer,  otherwise we just
> > > > need to set the currFile and currOffset to the absolute number and set
> > > > the pos and nbytes 0.  Attached patch fixes this issue.
> > > >
> > >
> > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> > > 1.
> > > +
> > > + /*
> > > + * If the truncate point is within existing buffer then we can just
> > > + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> > > + * we don't need to do anything because we have already deleted/truncated
> > > + * the underlying files.
> > > + */
> > > + if (curFile == file->curFile &&
> > > + curOffset >= file->curOffset &&
> > > + curOffset <= file->curOffset + file->nbytes)
> > > + {
> > > + file->pos = (int) (curOffset - file->curOffset);
> > > + return;
> > > + }
> > >
> > > I think in this case you have set the position correctly but what
> > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> > > because the contents of the buffer are still valid but I don't think
> > > the same is true here.
> > >
> >
> > I think you need to set 'nbytes' to curOffset as per your current
> > patch as that is the new size of the file.
> > --- a/src/backend/storage/file/buffile.c
> > +++ b/src/backend/storage/file/buffile.c
> > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
> > off_t offset)
> >                 curOffset <= file->curOffset + file->nbytes)
> >         {
> >                 file->pos = (int) (curOffset - file->curOffset);
> > +               file->nbytes = (int) curOffset;
> >                 return;
> >         }
> >
> > Also, what about file 'numFiles', that can also change due to the
> > removal of certain files, shouldn't that be also set in this case
>
> Right, we need to set the numFile.  I will fix this as well.

I think there are a couple of more problems in the truncate APIs,
basically, if the curFile and curOffset are already smaller than the
truncate location the truncate should not change that.  So the
truncate should only change the curFile and curOffset if it is
truncating the part of the file where the curFile or curOffset is
pointing.  I will work on those along with your other comments and
submit the updated patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

20 августа 2020 г., 11:59:42

On Thu, Aug 20, 2020 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > >
> > > > >
> > > > > In last patch v49-0001, there is one issue,  Basically, I have called
> > > > > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > > > > the underlying files are deleted/truncated because those files/blocks
> > > > > might not exist now.  So I think if the truncate position is within
> > > > > the same buffer we just need to adjust the buffer,  otherwise we just
> > > > > need to set the currFile and currOffset to the absolute number and set
> > > > > the pos and nbytes 0.  Attached patch fixes this issue.
> > > > >
> > > >
> > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> > > > 1.
> > > > +
> > > > + /*
> > > > + * If the truncate point is within existing buffer then we can just
> > > > + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> > > > + * we don't need to do anything because we have already deleted/truncated
> > > > + * the underlying files.
> > > > + */
> > > > + if (curFile == file->curFile &&
> > > > + curOffset >= file->curOffset &&
> > > > + curOffset <= file->curOffset + file->nbytes)
> > > > + {
> > > > + file->pos = (int) (curOffset - file->curOffset);
> > > > + return;
> > > > + }
> > > >
> > > > I think in this case you have set the position correctly but what
> > > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> > > > because the contents of the buffer are still valid but I don't think
> > > > the same is true here.
> > > >
> > >
> > > I think you need to set 'nbytes' to curOffset as per your current
> > > patch as that is the new size of the file.
> > > --- a/src/backend/storage/file/buffile.c
> > > +++ b/src/backend/storage/file/buffile.c
> > > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
> > > off_t offset)
> > >                 curOffset <= file->curOffset + file->nbytes)
> > >         {
> > >                 file->pos = (int) (curOffset - file->curOffset);
> > > +               file->nbytes = (int) curOffset;
> > >                 return;
> > >         }
> > >
> > > Also, what about file 'numFiles', that can also change due to the
> > > removal of certain files, shouldn't that be also set in this case
> >
> > Right, we need to set the numFile.  I will fix this as well.
>
> I think there are a couple of more problems in the truncate APIs,
> basically, if the curFile and curOffset are already smaller than the
> truncate location the truncate should not change that.  So the
> truncate should only change the curFile and curOffset if it is
> truncating the part of the file where the curFile or curOffset is
> pointing.
>

Right, I think this can happen if one has changed those by BufFileSeek
before doing truncate. We should fix that case as well.

>  I will work on those along with your other comments and
> submit the updated patch.
>

Thanks.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

20 августа 2020 г., 15:12:40

On Thu, Aug 20, 2020 at 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Aug 19, 2020 at 1:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Wed, Aug 19, 2020 at 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 19, 2020 at 10:10 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 17, 2020 at 6:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > In last patch v49-0001, there is one issue,  Basically, I have called
> > > > > > BufFileFlush in all the cases.  But, ideally, we can not call this if
> > > > > > the underlying files are deleted/truncated because those files/blocks
> > > > > > might not exist now.  So I think if the truncate position is within
> > > > > > the same buffer we just need to adjust the buffer,  otherwise we just
> > > > > > need to set the currFile and currOffset to the absolute number and set
> > > > > > the pos and nbytes 0.  Attached patch fixes this issue.
> > > > > >
> > > > >
> > > > > Few comments on the latest patch v50-0001-Extend-the-BufFile-interface
> > > > > 1.
> > > > > +
> > > > > + /*
> > > > > + * If the truncate point is within existing buffer then we can just
> > > > > + * adjust pos-within-buffer, without flushing buffer.  Otherwise,
> > > > > + * we don't need to do anything because we have already deleted/truncated
> > > > > + * the underlying files.
> > > > > + */
> > > > > + if (curFile == file->curFile &&
> > > > > + curOffset >= file->curOffset &&
> > > > > + curOffset <= file->curOffset + file->nbytes)
> > > > > + {
> > > > > + file->pos = (int) (curOffset - file->curOffset);
> > > > > + return;
> > > > > + }
> > > > >
> > > > > I think in this case you have set the position correctly but what
> > > > > about file->nbytes? In BufFileSeek, it was okay not to update 'nbytes'
> > > > > because the contents of the buffer are still valid but I don't think
> > > > > the same is true here.
> > > > >
> > > >
> > > > I think you need to set 'nbytes' to curOffset as per your current
> > > > patch as that is the new size of the file.
> > > > --- a/src/backend/storage/file/buffile.c
> > > > +++ b/src/backend/storage/file/buffile.c
> > > > @@ -912,6 +912,7 @@ BufFileTruncateShared(BufFile *file, int fileno,
> > > > off_t offset)
> > > >                 curOffset <= file->curOffset + file->nbytes)
> > > >         {
> > > >                 file->pos = (int) (curOffset - file->curOffset);
> > > > +               file->nbytes = (int) curOffset;
> > > >                 return;
> > > >         }
> > > >
> > > > Also, what about file 'numFiles', that can also change due to the
> > > > removal of certain files, shouldn't that be also set in this case
> > >
> > > Right, we need to set the numFile.  I will fix this as well.
> >
> > I think there are a couple of more problems in the truncate APIs,
> > basically, if the curFile and curOffset are already smaller than the
> > truncate location the truncate should not change that.  So the
> > truncate should only change the curFile and curOffset if it is
> > truncating the part of the file where the curFile or curOffset is
> > pointing.
> >
>
> Right, I think this can happen if one has changed those by BufFileSeek
> before doing truncate. We should fix that case as well.

Right.

> >  I will work on those along with your other comments and
> > submit the updated patch.

I have fixed this in the attached patch along with your other
comments.  I have also attached a contrib module that is just used for
testing the truncate API.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Fri, Aug 21, 2020 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Aug 21, 2020 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 2.
> > + /*
> > + * If the new location is smaller then the current location in file then
> > + * we need to set the curFile and the curOffset to the new values and also
> > + * reset the pos and nbytes.  Otherwise nothing to do.
> > + */
> > + else if ((newFile < file->curFile) ||
> > + newOffset < file->curOffset + file->pos)
> > + {
> > + file->curFile = newFile;
> > + file->curOffset = newOffset;
> > + file->pos = 0;
> > + file->nbytes = 0;
> > + }
> >
> > Shouldn't there be && instead of || because if newFile is greater than
> > curFile then there is no meaning to update it?
>
> I think this condition is wrong it should be,
>
> else if ((newFile < file->curFile) || ((newFile == file->curFile) &&
> (newOffset < file->curOffset + file->pos)
>
> Basically, either new file is smaller otherwise if it is the same
> then-new offset should be smaller.
>

I think we don't need to use file->pos for that as that is required
only for the current buffer, otherwise, such a condition should
suffice the need. However, I was not happy with the way code and
conditions were arranged in BufFileTruncateShared, so I have
re-arranged them and change quite a few comments in that API. Apart
from that I have updated the docs and ran pgindent for the first
patch. Do let me know if you have any more comments on the first
patch?

-- 
With Regards,
Amit Kapila.

On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have reviewed and tested the patch and the changes look fine to me.
> >
>
> Thanks, I will push the next patch early next week (by Tuesday) unless
> you or someone else has any more comments on it. The summary of the
> patch (v52-0001-Extend-the-BufFile-interface, attached with my
> previous email) I am planning to push is: "It extends the BufFile
> interface to support temporary files that can be used by the single
> backend when the corresponding files need to be survived across the
> transaction and need to be opened and closed multiple times. Such
> files need to be created as a member of a SharedFileSet. We have
> implemented the interface for BufFileTruncate to allow files to be
> truncated up to a particular offset and extended the BufFileSeek API
> to support SEEK_END case. We have also added an option to provide a
> mode while opening the shared BufFiles instead of always opening in
> read-only mode. These enhancements in BufFile interface are required
> for the upcoming patch to allow the replication apply worker, to
> properly handle streamed in-progress transactions."

While reviewing 0002, I realized that instead of using individual
shared fileset for each transaction, we can use just one common shared
file set.  We can create individual buffile under one shared fileset
and whenever a transaction commits/aborts we can just delete its
buffile and the shared fileset can stay.

I have attached a POC patch for this idea and if we agree with this
approach then I will prepare a final patch in a couple of days.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

buffile_changes.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

25 августа 2020 г., 07:01:05

On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have reviewed and tested the patch and the changes look fine to me.
> > >
> >
> > Thanks, I will push the next patch early next week (by Tuesday) unless
> > you or someone else has any more comments on it. The summary of the
> > patch (v52-0001-Extend-the-BufFile-interface, attached with my
> > previous email) I am planning to push is: "It extends the BufFile
> > interface to support temporary files that can be used by the single
> > backend when the corresponding files need to be survived across the
> > transaction and need to be opened and closed multiple times. Such
> > files need to be created as a member of a SharedFileSet. We have
> > implemented the interface for BufFileTruncate to allow files to be
> > truncated up to a particular offset and extended the BufFileSeek API
> > to support SEEK_END case. We have also added an option to provide a
> > mode while opening the shared BufFiles instead of always opening in
> > read-only mode. These enhancements in BufFile interface are required
> > for the upcoming patch to allow the replication apply worker, to
> > properly handle streamed in-progress transactions."
>
> While reviewing 0002, I realized that instead of using individual
> shared fileset for each transaction, we can use just one common shared
> file set.  We can create individual buffile under one shared fileset
> and whenever a transaction commits/aborts we can just delete its
> buffile and the shared fileset can stay.
>

I think the existing design is superior as it allows the flexibility
to create transaction files in different temp_tablespaces which is
quite important to consider as we know the files will be created only
for large transactions. Once we fix the sharedfileset for a worker all
the files will be created in the temp_tablespaces chosen for the first
time apply worker creates it even if it got changed at some later
point of time (user can change its value and then do reload config
which I think will impact the worker settings as well). This all can
happen because we set the tablespaces at the time of
SharedFileSetInit.

The other relatively smaller thing which I don't like is that we
always need to create a buffile for subxact even though we don't need
it. We might be able to find some solution for this but I guess the
previous point is what bothers me more.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

25 августа 2020 г., 08:10:42

On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 24, 2020 at 9:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Aug 22, 2020 at 8:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Aug 21, 2020 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > I have reviewed and tested the patch and the changes look fine to me.
> > > >
> > >
> > > Thanks, I will push the next patch early next week (by Tuesday) unless
> > > you or someone else has any more comments on it. The summary of the
> > > patch (v52-0001-Extend-the-BufFile-interface, attached with my
> > > previous email) I am planning to push is: "It extends the BufFile
> > > interface to support temporary files that can be used by the single
> > > backend when the corresponding files need to be survived across the
> > > transaction and need to be opened and closed multiple times. Such
> > > files need to be created as a member of a SharedFileSet. We have
> > > implemented the interface for BufFileTruncate to allow files to be
> > > truncated up to a particular offset and extended the BufFileSeek API
> > > to support SEEK_END case. We have also added an option to provide a
> > > mode while opening the shared BufFiles instead of always opening in
> > > read-only mode. These enhancements in BufFile interface are required
> > > for the upcoming patch to allow the replication apply worker, to
> > > properly handle streamed in-progress transactions."
> >
> > While reviewing 0002, I realized that instead of using individual
> > shared fileset for each transaction, we can use just one common shared
> > file set.  We can create individual buffile under one shared fileset
> > and whenever a transaction commits/aborts we can just delete its
> > buffile and the shared fileset can stay.
> >
>
> I think the existing design is superior as it allows the flexibility
> to create transaction files in different temp_tablespaces which is
> quite important to consider as we know the files will be created only
> for large transactions. Once we fix the sharedfileset for a worker all
> the files will be created in the temp_tablespaces chosen for the first
> time apply worker creates it even if it got changed at some later
> point of time (user can change its value and then do reload config
> which I think will impact the worker settings as well). This all can
> happen because we set the tablespaces at the time of
> SharedFileSetInit.

Yeah, I agree with this point,  that if we use the single shared
fileset then it will always use the same tablespace for all the
streaming transactions.  And, we might get the benefit of concurrent
I/O if we use different tablespaces as we are not immediately flushing
the files to the disk.

> The other relatively smaller thing which I don't like is that we
> always need to create a buffile for subxact even though we don't need
> it. We might be able to find some solution for this but I guess the
> previous point is what bothers me more.

Yeah, if we go this way we might need to find some solution to this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

25 августа 2020 г., 15:57:31

On Tue, Aug 25, 2020 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I think the existing design is superior as it allows the flexibility
> > to create transaction files in different temp_tablespaces which is
> > quite important to consider as we know the files will be created only
> > for large transactions. Once we fix the sharedfileset for a worker all
> > the files will be created in the temp_tablespaces chosen for the first
> > time apply worker creates it even if it got changed at some later
> > point of time (user can change its value and then do reload config
> > which I think will impact the worker settings as well). This all can
> > happen because we set the tablespaces at the time of
> > SharedFileSetInit.
>
> Yeah, I agree with this point,  that if we use the single shared
> fileset then it will always use the same tablespace for all the
> streaming transactions.  And, we might get the benefit of concurrent
> I/O if we use different tablespaces as we are not immediately flushing
> the files to the disk.
>

Okay, so let's retain the original approach then. I have made a few
cosmetic modifications in the first two patches which include updating
docs, comments, slightly modify the commit message, and change the
code to match the nearby code. One change which you might have a
different opinion is below:

+ case WAIT_EVENT_LOGICAL_CHANGES_READ:
+ event_name = "ReorderLogicalChangesRead";
+ break;
+ case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
+ event_name = "ReorderLogicalChangesWrite";
+ break;
+ case WAIT_EVENT_LOGICAL_SUBXACT_READ:
+ event_name = "ReorderLogicalSubxactRead";
+ break;
+ case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
+ event_name = "ReorderLogicalSubxactWrite";
+ break;

Why do we want to name these events starting with name as Reorder*? I
think these are used in subscriber-side, so no need to use the word
Reorder, so I have removed it from the attached patch. I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

25 августа 2020 г., 17:05:03

On Tue, Aug 25, 2020 at 6:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Aug 25, 2020 at 10:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Tue, Aug 25, 2020 at 9:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > I think the existing design is superior as it allows the flexibility
> > > to create transaction files in different temp_tablespaces which is
> > > quite important to consider as we know the files will be created only
> > > for large transactions. Once we fix the sharedfileset for a worker all
> > > the files will be created in the temp_tablespaces chosen for the first
> > > time apply worker creates it even if it got changed at some later
> > > point of time (user can change its value and then do reload config
> > > which I think will impact the worker settings as well). This all can
> > > happen because we set the tablespaces at the time of
> > > SharedFileSetInit.
> >
> > Yeah, I agree with this point,  that if we use the single shared
> > fileset then it will always use the same tablespace for all the
> > streaming transactions.  And, we might get the benefit of concurrent
> > I/O if we use different tablespaces as we are not immediately flushing
> > the files to the disk.
> >
>
> Okay, so let's retain the original approach then. I have made a few
> cosmetic modifications in the first two patches which include updating
> docs, comments, slightly modify the commit message, and change the
> code to match the nearby code. One change which you might have a
> different opinion is below:
>
> + case WAIT_EVENT_LOGICAL_CHANGES_READ:
> + event_name = "ReorderLogicalChangesRead";
> + break;
> + case WAIT_EVENT_LOGICAL_CHANGES_WRITE:
> + event_name = "ReorderLogicalChangesWrite";
> + break;
> + case WAIT_EVENT_LOGICAL_SUBXACT_READ:
> + event_name = "ReorderLogicalSubxactRead";
> + break;
> + case WAIT_EVENT_LOGICAL_SUBXACT_WRITE:
> + event_name = "ReorderLogicalSubxactWrite";
> + break;
>
> Why do we want to name these events starting with name as Reorder*? I
> think these are used in subscriber-side, so no need to use the word
> Reorder, so I have removed it from the attached patch. I am planning
> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
> this series tomorrow unless you have any comments on the same.

Your changes in 0001 and 0002,  looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Jeff Janes

Дата:

26 августа 2020 г., 20:52:17

On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I am planning
to push the first patch (v53-0001-Extend-the-BufFile-interface) in
this series tomorrow unless you have any comments on the same.

I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:

bool found PG_USED_FOR_ASSERTS_ONLY = false;

Cheers,

Jeff

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

27 августа 2020 г., 08:45:54

On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:
>
>
> On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>
>>  I am planning
>> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
>> this series tomorrow unless you have any comments on the same.
>
>
>
> I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:
>
> bool        found PG_USED_FOR_ASSERTS_ONLY = false;
>

Thanks for the report. Tom Lane has already fixed this [1].

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

28 августа 2020 г., 11:48:05

On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:
> >
> >
> > On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >>
> >>  I am planning
> >> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
> >> this series tomorrow unless you have any comments on the same.
> >
> >
> >
> > I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:
> >
> > bool        found PG_USED_FOR_ASSERTS_ONLY = false;
> >
>
> Thanks for the report. Tom Lane has already fixed this [1].
>
> [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

As discussed, I have added a another test case for covering the out of
order subtransaction rollback scenario.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

tap_test_for_out_of_order_subxact_abort.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Neha Sharma

Дата:

28 августа 2020 г., 13:16:37

Hi,

I have done code coverage analysis on the latest patches(v53) and below is the report for the same.

Highlighted are the files where the coverage modifications were observed.

OS: Ubuntu 18.04

Patch applied on commit : 77c7267c37f7fa8e5e48abda4798afdbecb2b95a

File Name	Coverage
	Without logical decoding patch		On v53 (2,3,4,5) patch		Without v53-0003 patch
	%Line	%Function	%Line	%Function	%Line	%Function
src/backend/access/transam/xact.c	86.2	92.9	86.2	92.9	86.2	92.9
src/backend/access/transam/xloginsert.c	90.2	94.1	90.2	94.1	90.2	94.1
src/backend/access/transam/xlogreader.c	73.3	93.3	73.8	93.3	73.8	93.3
src/backend/replication/logical/decode.c	93.4	100	93.4	100	93.4	100
src/backend/access/rmgrdesc/xactdesc.c	54.4	63.6	54.4	63.6	54.4	63.6
src/backend/replication/logical/reorderbuffer.c	93.4	96.7	93.4	96.7	93.4	96.7
src/backend/utils/cache/inval.c	98.1	100	98.1	100	98.1	100
contrib/test_decoding/test_decoding.c	86.8	95.2	86.8	95.2	86.8	95.2
src/backend/replication/logical/logical.c	90.9	93.5	90.9	93.5	91.8	93.5
src/backend/access/heap/heapam.c	86.1	94.5	86.1	94.5	86.1	94.5
src/backend/access/index/genam.c	90.7	91.7	91.2	91.7	91.2	91.7
src/backend/access/table/tableam.c	90.6	100	90.6	100	90.6	100
src/backend/utils/time/snapmgr.c	81.1	98.1	80.2	98.1	81.1	98.1
src/include/access/tableam.h	92.5	100	92.5	100	92.5	100
src/backend/access/heap/heapam_visibility.c	77.8	100	77.8	100	77.8	100
src/backend/replication/walsender.c	90.5	97.8	90.5	97.8	90.9	100
src/backend/catalog/pg_subscription.c	96	100	96	100	96	100
src/backend/commands/subscriptioncmds.c	93.2	90	92.7	90	92.7	90
src/backend/postmaster/pgstat.c	64.2	85.1	63.9	85.1	64.6	86.1
src/backend/replication/libpqwalreceiver/libpqwalreceiver.c	82.4	95	82.5	95	83.6	95
src/backend/replication/logical/proto.c	93.5	91.3	93.7	93.3	93.7	93.3
src/backend/replication/logical/worker.c	91.6	96	91.5	97.4	91.9	97.4
src/backend/replication/pgoutput/pgoutput.c	81.9	100	85.5	100	86.2	100
src/backend/replication/slotfuncs.c	93	93.8	93	93.8	93	93.8
src/include/pgstat.h	100	-	100	-	100	-
src/backend/replication/logical/logicalfuncs.c	87.1	90	87.1	90	87.1	90
src/backend/storage/file/buffile.c	68.3	85	69.6	85	69.6	85
src/backend/storage/file/fd.c	81.1	93	81.1	93	81.1	93
src/backend/storage/file/sharedfileset.c	77.7	90.9	93.2	100	93.2	100
src/backend/utils/sort/logtape.c	94.4	100	94.4	100	94.4	100
src/backend/utils/sort/sharedtuplestore.c	90.1	90.9	90.1	90.9	90.1	90.9

Thanks.
--

Regards,

Neha Sharma

On Thu, Aug 27, 2020 at 11:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 26, 2020 at 11:22 PM Jeff Janes <jeff.janes@gmail.com> wrote:
>
>
> On Tue, Aug 25, 2020 at 8:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>>
>> I am planning
>> to push the first patch (v53-0001-Extend-the-BufFile-interface) in
>> this series tomorrow unless you have any comments on the same.
>
>
>
> I'm getting compiler warnings now, src/backend/storage/file/sharedfileset.c line 288 needs to be:
>
> bool found PG_USED_FOR_ASSERTS_ONLY = false;
>

Thanks for the report. Tom Lane has already fixed this [1].

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e942af7b8261cd8070d0eeaf518dbc1a664859fd

--
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

29 августа 2020 г., 14:47:46

On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> As discussed, I have added a another test case for covering the out of
> order subtransaction rollback scenario.
>

+# large (streamed) transaction with out of order subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{

How about writing a comment as: "large (streamed) transaction with
subscriber receiving out of order subtransaction ROLLBACKs"?

I have reviewed and modified the number of things in the attached patch:
1. In apply_handle_origin, improved the check streamed xacts.
2. In apply_handle_stream_commit() while applying changes in the loop,
added CHECK_FOR_INTERRUPTS.
3. In DEBUG messages, print the path with double-quotes as we are
doing in all other places.
4.
+ /*
+ * Exit if streaming option is changed. The launcher will start new
+ * worker.
+ */
+ if (newsub->stream != MySubscription->stream)
+ {
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\" will "
+ "restart because subscription's streaming option were changed",
+ MySubscription->name)));
+
+ proc_exit(0);
+ }
+
We don't need a separate check like this. I have merged this into one
of the existing checks.
5.
subxact_info_write()
{
..
+ if (subxact_data.nsubxacts == 0)
+ {
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ pfree(ent->subxact_fileset);
+ ent->subxact_fileset = NULL;
+ }

I don't think it is right to use BufFileDeleteShared interface here
because it won't perform SharedFileSetUnregister which means if after
above code execution is the server exits it will crash in
SharedFileSetDeleteOnProcExit which will try to access already deleted
fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead.
The another related problem is that in function
SharedFileSetDeleteOnProcExit, it tries to delete the list element
while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable. I have fixed this in a
separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you
are fine with this, I would like to commit this as this fixes a
problem in the existing commit 808e13b282.
6. Function stream_cleanup_files() contains a missing_ok argument
which is not used so removed it.
7. In pgoutput.c, change the ordering of functions to make them
consistent with their declaration.
8.
typedef struct RelationSyncEntry
 {
  Oid relid; /* relation oid */
+ TransactionId xid; /* transaction that created the record */

Removed above parameter as this doesn't seem to be required as per the
new design in the patch.

Apart from above, I have added/changed quite a few comments and a few
other cosmetic changes. Kindly review and let me know what do you
think about the changes?

One more comment for which I haven't done anything yet.
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);

Is it a good idea to append xid with lappend_int? Won't we need
something equivalent for uint32? If so, I think we have a couple of
options (a) use lcons method and accordingly append the pointer to
xid, I think we need to allocate memory for xid if we want to use this
idea or (b) use an array instead. What do you think?

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

30 августа 2020 г., 12:13:20

On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > As discussed, I have added a another test case for covering the out of
> > order subtransaction rollback scenario.
> >
>
> +# large (streamed) transaction with out of order subtransaction ROLLBACKs
> +$node_publisher->safe_psql('postgres', q{
>
> How about writing a comment as: "large (streamed) transaction with
> subscriber receiving out of order subtransaction ROLLBACKs"?

I have fixed and merged with 0002.

> I have reviewed and modified the number of things in the attached patch:
> 1. In apply_handle_origin, improved the check streamed xacts.
> 2. In apply_handle_stream_commit() while applying changes in the loop,
> added CHECK_FOR_INTERRUPTS.
> 3. In DEBUG messages, print the path with double-quotes as we are
> doing in all other places.
> 4.
> + /*
> + * Exit if streaming option is changed. The launcher will start new
> + * worker.
> + */
> + if (newsub->stream != MySubscription->stream)
> + {
> + ereport(LOG,
> + (errmsg("logical replication apply worker for subscription \"%s\" will "
> + "restart because subscription's streaming option were changed",
> + MySubscription->name)));
> +
> + proc_exit(0);
> + }
> +
> We don't need a separate check like this. I have merged this into one
> of the existing checks.
> 5.
> subxact_info_write()
> {
> ..
> + if (subxact_data.nsubxacts == 0)
> + {
> + if (ent->subxact_fileset)
> + {
> + cleanup_subxact_info();
> + BufFileDeleteShared(ent->subxact_fileset, path);
> + pfree(ent->subxact_fileset);
> + ent->subxact_fileset = NULL;
> + }
>
> I don't think it is right to use BufFileDeleteShared interface here
> because it won't perform SharedFileSetUnregister which means if after
> above code execution is the server exits it will crash in
> SharedFileSetDeleteOnProcExit which will try to access already deleted
> fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead.
> The another related problem is that in function
> SharedFileSetDeleteOnProcExit, it tries to delete the list element
> while traversing the list with 'foreach' construct which makes the
> behavior of list traversal unpredictable. I have fixed this in a
> separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you
> are fine with this, I would like to commit this as this fixes a
> problem in the existing commit 808e13b282.
> 6. Function stream_cleanup_files() contains a missing_ok argument
> which is not used so removed it.
> 7. In pgoutput.c, change the ordering of functions to make them
> consistent with their declaration.
> 8.
> typedef struct RelationSyncEntry
>  {
>   Oid relid; /* relation oid */
> + TransactionId xid; /* transaction that created the record */
>
> Removed above parameter as this doesn't seem to be required as per the
> new design in the patch.
>
> Apart from above, I have added/changed quite a few comments and a few
> other cosmetic changes. Kindly review and let me know what do you
> think about the changes?

I have reviewed your changes and look fine to me.  And the bug fix in
0001 also looks fine.

> One more comment for which I haven't done anything yet.
> +static void
> +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
> +{
> + MemoryContext oldctx;
> +
> + oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> +
> + entry->streamed_txns = lappend_int(entry->streamed_txns, xid);

> Is it a good idea to append xid with lappend_int? Won't we need
> something equivalent for uint32? If so, I think we have a couple of
> options (a) use lcons method and accordingly append the pointer to
> xid, I think we need to allocate memory for xid if we want to use this
> idea or (b) use an array instead. What do you think?

BTW, OID is internally mapped to uint32,  but using lappend_oid might
not look good.  So maybe we can provide an option for lappend_uint32?
Using an array is also not a bad idea.  Providing lappend_uint32
option looks more appealing to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

31 августа 2020 г., 08:19:28

On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Aug 29, 2020 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > One more comment for which I haven't done anything yet.
> > +static void
> > +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
> > +{
> > + MemoryContext oldctx;
> > +
> > + oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> > +
> > + entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
>
> > Is it a good idea to append xid with lappend_int? Won't we need
> > something equivalent for uint32? If so, I think we have a couple of
> > options (a) use lcons method and accordingly append the pointer to
> > xid, I think we need to allocate memory for xid if we want to use this
> > idea or (b) use an array instead. What do you think?
>
> BTW, OID is internally mapped to uint32,  but using lappend_oid might
> not look good.  So maybe we can provide an option for lappend_uint32?
> Using an array is also not a bad idea.  Providing lappend_uint32
> option looks more appealing to me.
>

I thought about this again and I feel it might be okay to use it for
our case as after storing it in T_IntList, we primarily fetch it for
comparison with TrasnactionId (uint32), so this shouldn't create any
problem. I feel we can just discuss this in a separate thread and
check the opinion of others, what do you think?

Another comment:

+cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
+{
+ HASH_SEQ_STATUS hash_seq;
+ RelationSyncEntry *entry;
+
+ Assert(RelationSyncCache != NULL);
+
+ hash_seq_init(&hash_seq, RelationSyncCache);
+ while ((entry = hash_seq_search(&hash_seq)) != NULL)
+ {
+ if (is_commit)
+ entry->schema_sent = true;

How is it correct to set 'entry->schema_sent' for all the entries in
RelationSyncCache? Consider a case where due to invalidation in an
unrelated transaction we have set the flag schema_sent for a
particular relation 'r1' as 'false' and that transaction is executed
before the current streamed transaction for which we are performing
commit and called this function. It will set the flag for unrelated
entry in this case 'r1' which doesn't seem correct to me. Or, if this
is correct, it would be a good idea to write some comments about it.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

31 августа 2020 г., 10:54:30

On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> Another comment:
>
> +cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
> +{
> + HASH_SEQ_STATUS hash_seq;
> + RelationSyncEntry *entry;
> +
> + Assert(RelationSyncCache != NULL);
> +
> + hash_seq_init(&hash_seq, RelationSyncCache);
> + while ((entry = hash_seq_search(&hash_seq)) != NULL)
> + {
> + if (is_commit)
> + entry->schema_sent = true;
>
> How is it correct to set 'entry->schema_sent' for all the entries in
> RelationSyncCache? Consider a case where due to invalidation in an
> unrelated transaction we have set the flag schema_sent for a
> particular relation 'r1' as 'false' and that transaction is executed
> before the current streamed transaction for which we are performing
> commit and called this function. It will set the flag for unrelated
> entry in this case 'r1' which doesn't seem correct to me. Or, if this
> is correct, it would be a good idea to write some comments about it.
>

Few more comments:
1.
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub"
+);

In most of the tests, we are using the above statement to create a
subscription. Don't we need (streaming = 'on') parameter while
creating a subscription? Is there a reason for not doing so in this
patch itself?

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

4. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?

Apart from the above, I have made a few changes in the attached patch
which are mainly to simplify the code at one place, added/edited few
comments, some other cosmetic changes, and renamed the test case files
as the initials of their name were matching other tests in the similar
directory.

-- 
With Regards,
Amit Kapila.

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

31 августа 2020 г., 16:39:07

On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> 2.
> 009_stream_simple.pl
> +# Insert, update and delete enough rows to exceed the 64kB limit.
> +$node_publisher->safe_psql('postgres', q{
> +BEGIN;
> +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> +DELETE FROM test_tab WHERE mod(a,3) = 0;
> +COMMIT;
> +});
>
> How much above this data is 64kB limit? I just wanted to see that it
> should not be on borderline and then due to some alignment issues the
> streaming doesn't happen on some machines?
>

I think we should find similar information for other tests added by
the patch as well.

Few other comments:
===================
+sub wait_for_caught_up
+{
+ my ($node, $appname) = @_;
+
+ $node->poll_query_until('postgres',
+"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication
WHERE application_name = '$appname';"
+ ) or die "Timed ou

The patch has added this in all the test files if it is used in so
many tests then we need to add this in some generic place
(PostgresNode.pm) but actually, I am not sure if need this at all. Why
can't the existing wait_for_catchup in PostgresNode.pm serve the same
purpose.

2.
In system_views.sql,

-- All columns of pg_subscription except subconninfo are readable.
REVOKE ALL ON pg_subscription FROM public;
GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary,
subslotname, subpublications)
    ON pg_subscription TO public;

Here, we need to update for substream column as well.

3. Update describeSubscriptions() to show the 'substream' value in \dRs.

4. Also, lets add few tests in subscription.sql as we have added
'binary' option in commit 9de77b5453.

5. I think we can merge pg_dump related changes (the last version
posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in
the main patch, one minor comment on pg_dump related changes
@@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
  if (strcmp(subinfo->subbinary, "t") == 0)
  appendPQExpBuffer(query, ", binary = true");

+ if (strcmp(subinfo->substream, "f") != 0)
+ appendPQExpBuffer(query, ", streaming = on");
  if (strcmp(subinfo->subsynccommit, "off") != 0)
  appendPQExpBuffer(query, ", synchronous_commit = %s",
fmtId(subinfo->subsynccommit));

Keep one line space between substream and subsynccommit option code to
keep it consistent with nearby code.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

31 августа 2020 г., 16:58:38

On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> >
> > Another comment:
> >
> > +cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
> > +{
> > + HASH_SEQ_STATUS hash_seq;
> > + RelationSyncEntry *entry;
> > +
> > + Assert(RelationSyncCache != NULL);
> > +
> > + hash_seq_init(&hash_seq, RelationSyncCache);
> > + while ((entry = hash_seq_search(&hash_seq)) != NULL)
> > + {
> > + if (is_commit)
> > + entry->schema_sent = true;
> >
> > How is it correct to set 'entry->schema_sent' for all the entries in
> > RelationSyncCache? Consider a case where due to invalidation in an
> > unrelated transaction we have set the flag schema_sent for a
> > particular relation 'r1' as 'false' and that transaction is executed
> > before the current streamed transaction for which we are performing
> > commit and called this function. It will set the flag for unrelated
> > entry in this case 'r1' which doesn't seem correct to me. Or, if this
> > is correct, it would be a good idea to write some comments about it.

Yeah, this is wrong,  I have fixed this issue in the attached patch
and also added a new test for the same.

> Few more comments:
> 1.
> +my $appname = 'tap_sub';
> +$node_subscriber->safe_psql('postgres',
> +"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
> application_name=$appname' PUBLICATION tap_pub"
> +);
>
> In most of the tests, we are using the above statement to create a
> subscription. Don't we need (streaming = 'on') parameter while
> creating a subscription? Is there a reason for not doing so in this
> patch itself?

I have changed this.

> 2.
> 009_stream_simple.pl
> +# Insert, update and delete enough rows to exceed the 64kB limit.
> +$node_publisher->safe_psql('postgres', q{
> +BEGIN;
> +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> +DELETE FROM test_tab WHERE mod(a,3) = 0;
> +COMMIT;
> +});
>
> How much above this data is 64kB limit? I just wanted to see that it
> should not be on borderline and then due to some alignment issues the
> streaming doesn't happen on some machines? Also, how such a test
> ensures that the streaming has happened because the way we are
> checking results, won't it be the same for the non-streaming case as
> well?

Only for this case, or you mean for all the tests?

> 3.
> +# Change the local values of the extra columns on the subscriber,
> +# update publisher, and check that subscriber retains the expected
> +# values
> +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> 'epoch'::timestamptz + 987654321 * interval '1s'");
> +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> +
> +wait_for_caught_up($node_publisher, $appname);
> +
> +$result =
> +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> count(extract(epoch from c) = 987654321), count(d = 999) FROM
> test_tab");
> +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> changed data');
>
> Again, how this test is relevant to streaming mode?

I agree, it is not specific to the streaming.


> 4. I have checked that in one of the previous patches, we have a test
> v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
> quite similar to what we have in
> v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
> If there is any difference that can cover more scenarios then can we
> consider merging them into one test?

I will have a look.

> Apart from the above, I have made a few changes in the attached patch
> which are mainly to simplify the code at one place, added/edited few
> comments, some other cosmetic changes, and renamed the test case files
> as the initials of their name were matching other tests in the similar
> directory.

Changes look fine to me except this

+

+ /* the value must be on/off */
+ if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid streaming value")));
+
+ /* enable streaming if it's 'on' */
+ *enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);

I mean for streaming why we need to handle differently than the other
surrounding code for example "binary" option.

Apart from that for testing 0001, I have added a new test in the
attached contrib.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Neha Sharma

Дата:

31 августа 2020 г., 19:57:31

Hi Amit/Dilip,

I have tested a few scenarios on top of the v56 patches, where the replication worker still had few subtransactions in uncommitted state and we restart the publisher server.

No crash or data discrepancies were observed, attached are the test scenarios verified.

Data Setup:

Publication Server postgresql.conf :

echo "wal_level = logical
max_wal_senders = 10

max_replication_slots = 15

wal_log_hints = on

hot_standby_feedback = on

wal_receiver_status_interval = 1

listen_addresses='*'

log_min_messages=debug1

wal_sender_timeout = 0

logical_decoding_work_mem=64kB

Subscription Server postgresql.conf :

wal_level = logical

max_wal_senders = 10

max_replication_slots = 15

wal_log_hints = on

hot_standby_feedback = on

wal_receiver_status_interval = 1

listen_addresses='*'

log_min_messages=debug1

wal_sender_timeout = 0

logical_decoding_work_mem=64kB

port=5433

Initial setup:

Publication Server:

create table t(a int PRIMARY KEY ,b text);

CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';

create publication test_pub for table t with(PUBLISH='insert,delete,update,truncate');

alter table t replica identity FULL ;

insert into t values (generate_series(1,20),large_val()) ON CONFLICT (a) DO UPDATE SET a=EXCLUDED.a*300;

Subscription server:
create table t(a int,b text);
create subscription test_sub CONNECTION 'host=localhost port=5432 dbname=postgres user=edb' PUBLICATION test_pub WITH ( slot_name = test_slot_sub1,streaming=on);

Thanks.
--

Regards,

Neha Sharma

On Mon, Aug 31, 2020 at 1:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> Another comment:
>
> +cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
> +{
> + HASH_SEQ_STATUS hash_seq;
> + RelationSyncEntry *entry;
> +
> + Assert(RelationSyncCache != NULL);
> +
> + hash_seq_init(&hash_seq, RelationSyncCache);
> + while ((entry = hash_seq_search(&hash_seq)) != NULL)
> + {
> + if (is_commit)
> + entry->schema_sent = true;
>
> How is it correct to set 'entry->schema_sent' for all the entries in
> RelationSyncCache? Consider a case where due to invalidation in an
> unrelated transaction we have set the flag schema_sent for a
> particular relation 'r1' as 'false' and that transaction is executed
> before the current streamed transaction for which we are performing
> commit and called this function. It will set the flag for unrelated
> entry in this case 'r1' which doesn't seem correct to me. Or, if this
> is correct, it would be a good idea to write some comments about it.
>

Few more comments:
1.
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
application_name=$appname' PUBLICATION tap_pub"
+);

In most of the tests, we are using the above statement to create a
subscription. Don't we need (streaming = 'on') parameter while
creating a subscription? Is there a reason for not doing so in this
patch itself?

2.
009_stream_simple.pl
+# Insert, update and delete enough rows to exceed the 64kB limit.
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
+UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
+DELETE FROM test_tab WHERE mod(a,3) = 0;
+COMMIT;
+});

How much above this data is 64kB limit? I just wanted to see that it
should not be on borderline and then due to some alignment issues the
streaming doesn't happen on some machines? Also, how such a test
ensures that the streaming has happened because the way we are
checking results, won't it be the same for the non-streaming case as
well?

3.
+# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?

4. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?

Apart from the above, I have made a few changes in the attached patch
which are mainly to simplify the code at one place, added/edited few
comments, some other cosmetic changes, and renamed the test case files
as the initials of their name were matching other tests in the similar
directory.

--
With Regards,
Amit Kapila.

Вложения

test_case

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

01 сентября 2020 г., 06:58:34

On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 31, 2020 at 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sun, Aug 30, 2020 at 2:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > >
> > > Another comment:
> > >
> > > +cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
> > > +{
> > > + HASH_SEQ_STATUS hash_seq;
> > > + RelationSyncEntry *entry;
> > > +
> > > + Assert(RelationSyncCache != NULL);
> > > +
> > > + hash_seq_init(&hash_seq, RelationSyncCache);
> > > + while ((entry = hash_seq_search(&hash_seq)) != NULL)
> > > + {
> > > + if (is_commit)
> > > + entry->schema_sent = true;
> > >
> > > How is it correct to set 'entry->schema_sent' for all the entries in
> > > RelationSyncCache? Consider a case where due to invalidation in an
> > > unrelated transaction we have set the flag schema_sent for a
> > > particular relation 'r1' as 'false' and that transaction is executed
> > > before the current streamed transaction for which we are performing
> > > commit and called this function. It will set the flag for unrelated
> > > entry in this case 'r1' which doesn't seem correct to me. Or, if this
> > > is correct, it would be a good idea to write some comments about it.
>
> Yeah, this is wrong,  I have fixed this issue in the attached patch
> and also added a new test for the same.
>

In functions cleanup_rel_sync_cache and
get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to
uint32 as suggested by Tom [1]. Also, lets keep the way we compare
xids consistent in both functions, i.e, if (xid == lfirst_int(lc)).

The behavior tested by the test case added for this is not clear
primarily because of comments.

+++ b/src/test/subscription/t/021_stream_schema.pl
@@ -0,0 +1,80 @@
+# Test behavior with streaming transaction exceeding logical_decoding_work_mem
...
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+ALTER TABLE test_tab ADD COLUMN c INT;
+INSERT INTO test_tab SELECT i, md5(i::text), i FROM
generate_series(3,3000) s(i);
+ALTER TABLE test_tab ADD COLUMN d INT;
+COMMIT;
+});
+
+# large (streamed) transaction with DDL, DML and ROLLBACKs
+$node_publisher->safe_psql('postgres', q{
+BEGIN;
+INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM
generate_series(3001,3005) s(i);
+COMMIT;
+});
+wait_for_caught_up($node_publisher, $appname);

I understand that how this test will test the functionality related to
schema_sent stuff but neither the comments atop of file nor atop the
test case explains it clearly.

> > Few more comments:

>
> > 2.
> > 009_stream_simple.pl
> > +# Insert, update and delete enough rows to exceed the 64kB limit.
> > +$node_publisher->safe_psql('postgres', q{
> > +BEGIN;
> > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> > +DELETE FROM test_tab WHERE mod(a,3) = 0;
> > +COMMIT;
> > +});
> >
> > How much above this data is 64kB limit? I just wanted to see that it
> > should not be on borderline and then due to some alignment issues the
> > streaming doesn't happen on some machines? Also, how such a test
> > ensures that the streaming has happened because the way we are
> > checking results, won't it be the same for the non-streaming case as
> > well?
>
> Only for this case, or you mean for all the tests?
>

It is better to do it for all tests and I have clarified this in my
next email sent yesterday [2] where I have raised a few more comments
as well. I hope you have not missed that email.

> > 3.
> > +# Change the local values of the extra columns on the subscriber,
> > +# update publisher, and check that subscriber retains the expected
> > +# values
> > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> > 'epoch'::timestamptz + 987654321 * interval '1s'");
> > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> > +
> > +wait_for_caught_up($node_publisher, $appname);
> > +
> > +$result =
> > +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> > count(extract(epoch from c) = 987654321), count(d = 999) FROM
> > test_tab");
> > +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> > changed data');
> >
> > Again, how this test is relevant to streaming mode?
>
> I agree, it is not specific to the streaming.
>

> > Apart from the above, I have made a few changes in the attached patch
> > which are mainly to simplify the code at one place, added/edited few
> > comments, some other cosmetic changes, and renamed the test case files
> > as the initials of their name were matching other tests in the similar
> > directory.
>
> Changes look fine to me except this
>
> +
>
> + /* the value must be on/off */
> + if (strcmp(strVal(defel->arg), "on") && strcmp(strVal(defel->arg), "off"))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("invalid streaming value")));
> +
> + /* enable streaming if it's 'on' */
> + *enable_streaming = (strcmp(strVal(defel->arg), "on") == 0);
>
> I mean for streaming why we need to handle differently than the other
> surrounding code for example "binary" option.
>

Hmm, I think the code changed by me is to make it look similar to the
binary option. The code you have quoted above is from the patch
version prior to what I have sent. See the code snippet after my
changes:
@@ -182,6 +222,16 @@ parse_output_parameters(List *options, uint32
*protocol_version,

  *binary = defGetBoolean(defel);
  }
+ else if (strcmp(defel->defname, "streaming") == 0)
+ {
+ if (streaming_given)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting or redundant options")));
+ streaming_given = true;
+
+ *enable_streaming = defGetBoolean(defel);
+ }

This looks exactly similar to the binary option. Can you please check
it once again and confirm back?

[1] - https://www.postgresql.org/message-id/3955127.1598880523%40sss.pgh.pa.us
[2] - https://www.postgresql.org/message-id/CAA4eK1JjrcK6bk%2Bur3J%2BkLsfz4%2BipJFN7VcRd3cXr4gG5ZWWig%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

01 сентября 2020 г., 07:01:38

On Mon, Aug 31, 2020 at 10:27 PM Neha Sharma
<neha.sharma@enterprisedb.com> wrote:
>
> Hi Amit/Dilip,
>
> I have tested a few scenarios on  top of the v56 patches, where the replication worker still had few subtransactions
inuncommitted state and we restart the publisher server.
 
> No crash or data discrepancies were observed, attached are the test scenarios verified.
>

Thanks, I have pushed the fix
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4ab77697f67aa5b90b032b9175b46901859da6d7).

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

01 сентября 2020 г., 18:03:07

On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> In functions cleanup_rel_sync_cache and
> get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to
> uint32 as suggested by Tom [1]. Also, lets keep the way we compare
> xids consistent in both functions, i.e, if (xid == lfirst_int(lc)).
>

Fixed this in the attached patch.

> The behavior tested by the test case added for this is not clear
> primarily because of comments.
>
> +++ b/src/test/subscription/t/021_stream_schema.pl
> @@ -0,0 +1,80 @@
> +# Test behavior with streaming transaction exceeding logical_decoding_work_mem
> ...
> +# large (streamed) transaction with DDL, DML and ROLLBACKs
> +$node_publisher->safe_psql('postgres', q{
> +BEGIN;
> +ALTER TABLE test_tab ADD COLUMN c INT;
> +INSERT INTO test_tab SELECT i, md5(i::text), i FROM
> generate_series(3,3000) s(i);
> +ALTER TABLE test_tab ADD COLUMN d INT;
> +COMMIT;
> +});
> +
> +# large (streamed) transaction with DDL, DML and ROLLBACKs
> +$node_publisher->safe_psql('postgres', q{
> +BEGIN;
> +INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM
> generate_series(3001,3005) s(i);
> +COMMIT;
> +});
> +wait_for_caught_up($node_publisher, $appname);
>
> I understand that how this test will test the functionality related to
> schema_sent stuff but neither the comments atop of file nor atop the
> test case explains it clearly.
>

Added comments for this test.

> > > Few more comments:
>
> >
> > > 2.
> > > 009_stream_simple.pl
> > > +# Insert, update and delete enough rows to exceed the 64kB limit.
> > > +$node_publisher->safe_psql('postgres', q{
> > > +BEGIN;
> > > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> > > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> > > +DELETE FROM test_tab WHERE mod(a,3) = 0;
> > > +COMMIT;
> > > +});
> > >
> > > How much above this data is 64kB limit? I just wanted to see that it
> > > should not be on borderline and then due to some alignment issues the
> > > streaming doesn't happen on some machines? Also, how such a test
> > > ensures that the streaming has happened because the way we are
> > > checking results, won't it be the same for the non-streaming case as
> > > well?
> >
> > Only for this case, or you mean for all the tests?
> >
>

I have not done this yet.

> It is better to do it for all tests and I have clarified this in my
> next email sent yesterday [2] where I have raised a few more comments
> as well. I hope you have not missed that email.
>
> > > 3.
> > > +# Change the local values of the extra columns on the subscriber,
> > > +# update publisher, and check that subscriber retains the expected
> > > +# values
> > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> > > 'epoch'::timestamptz + 987654321 * interval '1s'");
> > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> > > +
> > > +wait_for_caught_up($node_publisher, $appname);
> > > +
> > > +$result =
> > > +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> > > count(extract(epoch from c) = 987654321), count(d = 999) FROM
> > > test_tab");
> > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> > > changed data');
> > >
> > > Again, how this test is relevant to streaming mode?
> >
> > I agree, it is not specific to the streaming.
> >

I think we can leave this as of now. After committing the stats
patches by Sawada-San and Ajin, we might be able to improve this test.

> +sub wait_for_caught_up
> +{
> + my ($node, $appname) = @_;
> +
> + $node->poll_query_until('postgres',
> +"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication
> WHERE application_name = '$appname';"
> + ) or die "Timed ou
>
> The patch has added this in all the test files if it is used in so
> many tests then we need to add this in some generic place
> (PostgresNode.pm) but actually, I am not sure if need this at all. Why
> can't the existing wait_for_catchup in PostgresNode.pm serve the same
> purpose.
>

Changed as per this suggestion.

> 2.
> In system_views.sql,
>
> -- All columns of pg_subscription except subconninfo are readable.
> REVOKE ALL ON pg_subscription FROM public;
> GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary,
> subslotname, subpublications)
>     ON pg_subscription TO public;
>
> Here, we need to update for substream column as well.
>

Fixed.

> 3. Update describeSubscriptions() to show the 'substream' value in \dRs.
>
> 4. Also, lets add few tests in subscription.sql as we have added
> 'binary' option in commit 9de77b5453.
>

Fixed both the above comments.

> 5. I think we can merge pg_dump related changes (the last version
> posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in
> the main patch, one minor comment on pg_dump related changes
> @@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
>   if (strcmp(subinfo->subbinary, "t") == 0)
>   appendPQExpBuffer(query, ", binary = true");
>
> + if (strcmp(subinfo->substream, "f") != 0)
> + appendPQExpBuffer(query, ", streaming = on");
>   if (strcmp(subinfo->subsynccommit, "off") != 0)
>   appendPQExpBuffer(query, ", synchronous_commit = %s",
> fmtId(subinfo->subsynccommit));
>
> Keep one line space between substream and subsynccommit option code to
> keep it consistent with nearby code.
>

Changed as per this suggestion.

I have fixed all the comments except the below comments.
1. verify the size of various tests to ensure that it is above
logical_decoding_work_mem.
2. I have checked that in one of the previous patches, we have a test
v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
quite similar to what we have in
v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
If there is any difference that can cover more scenarios then can we
consider merging them into one test?
3. +# Change the local values of the extra columns on the subscriber,
+# update publisher, and check that subscriber retains the expected
+# values
+$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
'epoch'::timestamptz + 987654321 * interval '1s'");
+$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
+
+wait_for_caught_up($node_publisher, $appname);
+
+$result =
+  $node_subscriber->safe_psql('postgres', "SELECT count(*),
count(extract(epoch from c) = 987654321), count(d = 999) FROM
test_tab");
+is($result, qq(3334|3334|3334), 'check extra columns contain locally
changed data');

Again, how this test is relevant to streaming mode?
4. Apart from the above, I think we should think of minimizing the
test cases which can be committed with the base patch. We can later
add more tests.

Kindly verify the changes.

-- 
With Regards,
Amit Kapila.

Вложения

v58-0001-Add-support-for-streaming-to-built-in-logical-re.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

02 сентября 2020 г., 08:24:47

On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have fixed all the comments except
..
> 3. +# Change the local values of the extra columns on the subscriber,
> +# update publisher, and check that subscriber retains the expected
> +# values
> +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> 'epoch'::timestamptz + 987654321 * interval '1s'");
> +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> +
> +wait_for_caught_up($node_publisher, $appname);
> +
> +$result =
> +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> count(extract(epoch from c) = 987654321), count(d = 999) FROM
> test_tab");
> +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> changed data');
>
> Again, how this test is relevant to streaming mode?
>

I think we can keep this test in one of the newly added tests say in
015_stream_simple.pl to ensure that after streaming transaction, the
non-streaming one behaves expectedly. So we can change the comment as
"Change the local values of the extra columns on the subscriber,
update publisher, and check that subscriber retains the expected
values. This is to ensure that non-streaming transactions behave
properly after a streaming transaction."

We can remove this test from the other two places
016_stream_subxact.pl and 020_stream_binary.pl.

> 4. Apart from the above, I think we should think of minimizing the
> test cases which can be committed with the base patch. We can later
> add more tests.
>

We can combine the tests in 015_stream_simple.pl and
020_stream_binary.pl as I can't see a good reason to keep them
separate. Then, I think we can keep only this part with the main patch
and extract other tests into a separate patch. Basically, we can
commit the basic tests with the main patch and then keep the advanced
tests separately. I am afraid that there are some tests that don't add
much value so we can review them separately.

One minor comment for option 'streaming = on', spacing-wise it should
be consistent in all the tests.

Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl
as both contains similar tests. As per the above suggestion, this will
be in a separate patch though.

If you agree with the above suggestions then kindly make these
adjustments and send the updated patch.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

02 сентября 2020 г., 13:11:25

On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have fixed all the comments except
> ..
> > 3. +# Change the local values of the extra columns on the subscriber,
> > +# update publisher, and check that subscriber retains the expected
> > +# values
> > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> > 'epoch'::timestamptz + 987654321 * interval '1s'");
> > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> > +
> > +wait_for_caught_up($node_publisher, $appname);
> > +
> > +$result =
> > +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> > count(extract(epoch from c) = 987654321), count(d = 999) FROM
> > test_tab");
> > +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> > changed data');
> >
> > Again, how this test is relevant to streaming mode?
> >
>
> I think we can keep this test in one of the newly added tests say in
> 015_stream_simple.pl to ensure that after streaming transaction, the
> non-streaming one behaves expectedly. So we can change the comment as
> "Change the local values of the extra columns on the subscriber,
> update publisher, and check that subscriber retains the expected
> values. This is to ensure that non-streaming transactions behave
> properly after a streaming transaction."
>
> We can remove this test from the other two places
> 016_stream_subxact.pl and 020_stream_binary.pl.
>
> > 4. Apart from the above, I think we should think of minimizing the
> > test cases which can be committed with the base patch. We can later
> > add more tests.
> >
>
> We can combine the tests in 015_stream_simple.pl and
> 020_stream_binary.pl as I can't see a good reason to keep them
> separate. Then, I think we can keep only this part with the main patch
> and extract other tests into a separate patch. Basically, we can
> commit the basic tests with the main patch and then keep the advanced
> tests separately. I am afraid that there are some tests that don't add
> much value so we can review them separately.

Fixed

> One minor comment for option 'streaming = on', spacing-wise it should
> be consistent in all the tests.
>
> Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl
> as both contains similar tests. As per the above suggestion, this will
> be in a separate patch though.
>
> If you agree with the above suggestions then kindly make these
> adjustments and send the updated patch.

Done that way.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вложения

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

02 сентября 2020 г., 13:24:26

On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 31, 2020 at 7:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Mon, Aug 31, 2020 at 1:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > In functions cleanup_rel_sync_cache and
> > get_schema_sent_in_streamed_txn, lets cast the result of lfirst_int to
> > uint32 as suggested by Tom [1]. Also, lets keep the way we compare
> > xids consistent in both functions, i.e, if (xid == lfirst_int(lc)).
> >
>
> Fixed this in the attached patch.
>
> > The behavior tested by the test case added for this is not clear
> > primarily because of comments.
> >
> > +++ b/src/test/subscription/t/021_stream_schema.pl
> > @@ -0,0 +1,80 @@
> > +# Test behavior with streaming transaction exceeding logical_decoding_work_mem
> > ...
> > +# large (streamed) transaction with DDL, DML and ROLLBACKs
> > +$node_publisher->safe_psql('postgres', q{
> > +BEGIN;
> > +ALTER TABLE test_tab ADD COLUMN c INT;
> > +INSERT INTO test_tab SELECT i, md5(i::text), i FROM
> > generate_series(3,3000) s(i);
> > +ALTER TABLE test_tab ADD COLUMN d INT;
> > +COMMIT;
> > +});
> > +
> > +# large (streamed) transaction with DDL, DML and ROLLBACKs
> > +$node_publisher->safe_psql('postgres', q{
> > +BEGIN;
> > +INSERT INTO test_tab SELECT i, md5(i::text), i, i FROM
> > generate_series(3001,3005) s(i);
> > +COMMIT;
> > +});
> > +wait_for_caught_up($node_publisher, $appname);
> >
> > I understand that how this test will test the functionality related to
> > schema_sent stuff but neither the comments atop of file nor atop the
> > test case explains it clearly.
> >
>
> Added comments for this test.
>
> > > > Few more comments:
> >
> > >
> > > > 2.
> > > > 009_stream_simple.pl
> > > > +# Insert, update and delete enough rows to exceed the 64kB limit.
> > > > +$node_publisher->safe_psql('postgres', q{
> > > > +BEGIN;
> > > > +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
> > > > +UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
> > > > +DELETE FROM test_tab WHERE mod(a,3) = 0;
> > > > +COMMIT;
> > > > +});
> > > >
> > > > How much above this data is 64kB limit? I just wanted to see that it
> > > > should not be on borderline and then due to some alignment issues the
> > > > streaming doesn't happen on some machines? Also, how such a test
> > > > ensures that the streaming has happened because the way we are
> > > > checking results, won't it be the same for the non-streaming case as
> > > > well?
> > >
> > > Only for this case, or you mean for all the tests?
> > >
> >
>
> I have not done this yet.
Most of the test cases are generating above 100kb and a few are around
72kb, Please find the test case wise data size.

015 - 200kb
016 - 150kb
017 - 72kb
018 - 72kb before first rollback to sb and total ~100kb
019 - 76kb before first rollback to sb and total ~100kb
020 - 150kb
021 - 100kb

> > It is better to do it for all tests and I have clarified this in my
> > next email sent yesterday [2] where I have raised a few more comments
> > as well. I hope you have not missed that email.

I saw that I think I replied to this before seeing that.

> > > > 3.
> > > > +# Change the local values of the extra columns on the subscriber,
> > > > +# update publisher, and check that subscriber retains the expected
> > > > +# values
> > > > +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> > > > 'epoch'::timestamptz + 987654321 * interval '1s'");
> > > > +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> > > > +
> > > > +wait_for_caught_up($node_publisher, $appname);
> > > > +
> > > > +$result =
> > > > +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> > > > count(extract(epoch from c) = 987654321), count(d = 999) FROM
> > > > test_tab");
> > > > +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> > > > changed data');
> > > >
> > > > Again, how this test is relevant to streaming mode?
> > >
> > > I agree, it is not specific to the streaming.
> > >
>
> I think we can leave this as of now. After committing the stats
> patches by Sawada-San and Ajin, we might be able to improve this test.

Make sense to me.

> > +sub wait_for_caught_up
> > +{
> > + my ($node, $appname) = @_;
> > +
> > + $node->poll_query_until('postgres',
> > +"SELECT pg_current_wal_lsn() <= replay_lsn FROM pg_stat_replication
> > WHERE application_name = '$appname';"
> > + ) or die "Timed ou
> >
> > The patch has added this in all the test files if it is used in so
> > many tests then we need to add this in some generic place
> > (PostgresNode.pm) but actually, I am not sure if need this at all. Why
> > can't the existing wait_for_catchup in PostgresNode.pm serve the same
> > purpose.
> >
>
> Changed as per this suggestion.

Okay.

> > 2.
> > In system_views.sql,
> >
> > -- All columns of pg_subscription except subconninfo are readable.
> > REVOKE ALL ON pg_subscription FROM public;
> > GRANT SELECT (subdbid, subname, subowner, subenabled, subbinary,
> > subslotname, subpublications)
> >     ON pg_subscription TO public;
> >
> > Here, we need to update for substream column as well.
> >
>
> Fixed.

 LGTM

> > 3. Update describeSubscriptions() to show the 'substream' value in \dRs.
> >
> > 4. Also, lets add few tests in subscription.sql as we have added
> > 'binary' option in commit 9de77b5453.
> >
>
> Fixed both the above comments.

Ok

> > 5. I think we can merge pg_dump related changes (the last version
> > posted in mail thread is v53-0005-Add-streaming-option-in-pg_dump) in
> > the main patch, one minor comment on pg_dump related changes
> > @@ -4358,6 +4369,8 @@ dumpSubscription(Archive *fout, SubscriptionInfo *subinfo)
> >   if (strcmp(subinfo->subbinary, "t") == 0)
> >   appendPQExpBuffer(query, ", binary = true");
> >
> > + if (strcmp(subinfo->substream, "f") != 0)
> > + appendPQExpBuffer(query, ", streaming = on");
> >   if (strcmp(subinfo->subsynccommit, "off") != 0)
> >   appendPQExpBuffer(query, ", synchronous_commit = %s",
> > fmtId(subinfo->subsynccommit));
> >
> > Keep one line space between substream and subsynccommit option code to
> > keep it consistent with nearby code.
> >
>
> Changed as per this suggestion.

Ok


> I have fixed all the comments except the below comments.
> 1. verify the size of various tests to ensure that it is above
> logical_decoding_work_mem.
> 2. I have checked that in one of the previous patches, we have a test
> v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
> quite similar to what we have in
> v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
> If there is any difference that can cover more scenarios then can we
> consider merging them into one test?
> 3. +# Change the local values of the extra columns on the subscriber,
> +# update publisher, and check that subscriber retains the expected
> +# values
> +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> 'epoch'::timestamptz + 987654321 * interval '1s'");
> +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = md5(a::text)");
> +
> +wait_for_caught_up($node_publisher, $appname);
> +
> +$result =
> +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> count(extract(epoch from c) = 987654321), count(d = 999) FROM
> test_tab");
> +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> changed data');
>
> Again, how this test is relevant to streaming mode?
> 4. Apart from the above, I think we should think of minimizing the
> test cases which can be committed with the base patch. We can later
> add more tests.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

02 сентября 2020 г., 16:49:13

On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> >
> > We can combine the tests in 015_stream_simple.pl and
> > 020_stream_binary.pl as I can't see a good reason to keep them
> > separate. Then, I think we can keep only this part with the main patch
> > and extract other tests into a separate patch. Basically, we can
> > commit the basic tests with the main patch and then keep the advanced
> > tests separately. I am afraid that there are some tests that don't add
> > much value so we can review them separately.
>
> Fixed
>

I have slightly adjusted this test and ran pgindent on the patch. I am
planning to push this tomorrow unless you have more comments.

-- 
With Regards,
Amit Kapila.

Вложения

v60-0001-Add-support-for-streaming-to-built-in-logical-re.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

02 сентября 2020 г., 17:08:49

On Wed, Sep 2, 2020 at 7:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Sep 2, 2020 at 3:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Sep 2, 2020 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> >
> > We can combine the tests in 015_stream_simple.pl and
> > 020_stream_binary.pl as I can't see a good reason to keep them
> > separate. Then, I think we can keep only this part with the main patch
> > and extract other tests into a separate patch. Basically, we can
> > commit the basic tests with the main patch and then keep the advanced
> > tests separately. I am afraid that there are some tests that don't add
> > much value so we can review them separately.
>
> Fixed
>

I have slightly adjusted this test and ran pgindent on the patch. I am
planning to push this tomorrow unless you have more comments.

Looks good to me.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

"Bossart, Nathan"

Дата:

04 сентября 2020 г., 00:39:55

I noticed a small compiler warning for this.

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 812aca8011..88d3444c39 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -199,7 +199,7 @@ typedef struct ApplySubXactData
 static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};

 static void subxact_filename(char *path, Oid subid, TransactionId xid);
-static void changes_filename(char *path, Oid subid, TransactionId xid);
+static inline void changes_filename(char *path, Oid subid, TransactionId xid);

 /*
  * Information about subtransactions of a given toplevel transaction.

Nathan

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

04 сентября 2020 г., 05:04:47

On Fri, Sep 4, 2020 at 3:10 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> I noticed a small compiler warning for this.
>
> diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
> index 812aca8011..88d3444c39 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -199,7 +199,7 @@ typedef struct ApplySubXactData
>  static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
>
>  static void subxact_filename(char *path, Oid subid, TransactionId xid);
> -static void changes_filename(char *path, Oid subid, TransactionId xid);
> +static inline void changes_filename(char *path, Oid subid, TransactionId xid);
>

Thanks for the report, I'll take care of this. I think the nearby
similar function subxact_filename() should also be inline.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

05 сентября 2020 г., 13:32:23

On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have fixed all the comments except the below comments.
> 1. verify the size of various tests to ensure that it is above
> logical_decoding_work_mem.
> 2. I have checked that in one of the previous patches, we have a test
> v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case
> quite similar to what we have in
> v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.
> If there is any difference that can cover more scenarios then can we
> consider merging them into one test?
>

I have compared these two tests and found that the only thing
additional in the test case present in
v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing
few savepoints and DMLs after doing the first rollback to savepoint
and I included that in one of the existing tests in
018_stream_subxact_abort.pl. I have added one test for Rollback,
changed few messages, removed one test case which was not making any
sense in the patch. See attached and let me know what you think about
it?

-- 
With Regards,
Amit Kapila.

Вложения

v61-0001-Add-additional-tests-to-test-streaming-of-in-pro.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

05 сентября 2020 г., 18:25:03

On Sat, 5 Sep 2020 at 4:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

>

> On Tue, Sep 1, 2020 at 9:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

>

> I have fixed all the comments except the below comments.

> 1. verify the size of various tests to ensure that it is above

> logical_decoding_work_mem.

> 2. I have checked that in one of the previous patches, we have a test

> v53-0004-Add-TAP-test-for-streaming-vs.-DDL which contains a test case

> quite similar to what we have in

> v55-0002-Add-support-for-streaming-to-built-in-logical-re/013_stream_subxact_ddl_abort.

> If there is any difference that can cover more scenarios then can we

> consider merging them into one test?

>

I have compared these two tests and found that the only thing

additional in the test case present in

v53-0004-Add-TAP-test-for-streaming-vs.-DDL was that it was performing

few savepoints and DMLs after doing the first rollback to savepoint

and I included that in one of the existing tests in

018_stream_subxact_abort.pl. I have added one test for Rollback,

changed few messages, removed one test case which was not making any

sense in the patch. See attached and let me know what you think about

it?

I have reviewed the changes and looks fine to me.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

07 сентября 2020 г., 09:30:41

On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> I have reviewed the changes and looks fine to me.
>

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports and then we can probably close this CF
entry. I am aware that we have one patch related to stats still
pending but I think we can tackle it along with the spill stats patch
which is being discussed in a different thread [1]. Do let me know if
I have missed anything?

[1] - https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

07 сентября 2020 г., 10:27:34

On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> I have reviewed the changes and looks fine to me.
>

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports and then we can probably close this CF
entry.

Thanks.

I am aware that we have one patch related to stats still
pending but I think we can tackle it along with the spill stats patch
which is being discussed in a different thread [1]. Do let me know if
I have missed anything?

[1] - https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com

Sound good to me.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

09 сентября 2020 г., 06:11:57

On Mon, Sep 7, 2020 at 12:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> >
>> >
>> > I have reviewed the changes and looks fine to me.
>> >
>>
>> Thanks, I have pushed the last patch. Let's wait for a day or so to
>> see the buildfarm reports and then we can probably close this CF
>> entry.
>
>
> Thanks.
>

I have updated the status of CF entry as committed now.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Tomas Vondra

Дата:

09 сентября 2020 г., 11:43:53

Hi,

while looking at the streaming code I noticed two minor issues:

1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary

2) minor typo in one of the comments

Patch attached.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

streaming-fixes.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

09 сентября 2020 г., 11:55:58

On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Hi,

while looking at the streaming code I noticed two minor issues:

1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary

Yeah, right.

2) minor typo in one of the comments

Patch attached.

Looks good to me.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

09 сентября 2020 г., 11:56:00

On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> while looking at the streaming code I noticed two minor issues:
>
> 1) logicalrep_read_stream_stop is never defined/called, so the prototype
> in logicalproto.h is unnecessary
>
> 2) minor typo in one of the comments
>
> Patch attached.
>

LGTM.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

12 сентября 2020 г., 14:47:13

On Wed, Sep 9, 2020 at 2:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > Hi,
> >
> > while looking at the streaming code I noticed two minor issues:
> >
> > 1) logicalrep_read_stream_stop is never defined/called, so the prototype
> > in logicalproto.h is unnecessary
> >
> > 2) minor typo in one of the comments
> >
> > Patch attached.
> >
>
> LGTM.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Tom Lane

Дата:

14 сентября 2020 г., 00:38:43

Amit Kapila <amit.kapila16@gmail.com> writes:
> Pushed.

Observe the following reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04

These are all on HEAD, and all within the last ten days, and I see
nothing comparable in any branch before that.  So it's hard to avoid
the conclusion that somebody broke something about ten days ago.

None of these animals provided gdb backtraces; but we do have a built-in
trace from several, and they all look like pgoutput.so is trying to
list_free() garbage, somewhere inside a relcache invalidation/rebuild
scenario:

TRAP: FailedAssertion("list->length > 0", File:
"/home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/../pgsql/src/backend/nodes/list.c",Line: 68) 
postgres: publisher: walsender bf [local] idle(ExceptionalCondition+0x57)[0x9081f7]
postgres: publisher: walsender bf [local] idle[0x6bcc70]
postgres: publisher: walsender bf [local] idle(list_free+0x11)[0x6bdc01]

/home/bf/build/buildfarm-idiacanthus/HEAD/pgsql.build/tmp_install/home/bf/build/buildfarm-idiacanthus/HEAD/inst/lib/postgresql/pgoutput.so(+0x35d8)[0x7fa4c5a6f5d8]
postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0x15b)[0x8f0cdb]
postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0x4b)[0x7bca0b]
postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6]
postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c]
postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486]
postgres: publisher: walsender bf [local] idle[0x9017f2]
postgres: publisher: walsender bf [local] idle[0x8fabd4]
postgres: publisher: walsender bf [local] idle[0x8fa58a]
postgres: publisher: walsender bf [local] idle(RelationCacheInvalidateEntry+0xaf)[0x8fbdbf]
postgres: publisher: walsender bf [local] idle(LocalExecuteInvalidationMessage+0xec)[0x8f0c6c]
postgres: publisher: walsender bf [local] idle(ReceiveSharedInvalidMessages+0xcb)[0x7bca8b]
postgres: publisher: walsender bf [local] idle(LockRelationOid+0x56)[0x7c19e6]
postgres: publisher: walsender bf [local] idle(relation_open+0x1c)[0x4a2d0c]
postgres: publisher: walsender bf [local] idle(table_open+0x6)[0x524486]
postgres: publisher: walsender bf [local] idle[0x8ee8b0]

010_truncate.pl itself hasn't changed meaningfully in a good long time.
However, I see that 464824323 added a whole boatload of code to
pgoutput.c, and the timing is right for that commit to be the culprit,
so that's what I'm betting on.

Probably this requires a relcache inval at the wrong time;
although we have recent passes from CLOBBER_CACHE_ALWAYS animals,
so that can't be the whole triggering condition.  I wonder whether
it is relevant that all of the complaining animals are JIT-enabled.

            regards, tom lane

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Tom Lane

Дата:

14 сентября 2020 г., 04:27:56

I wrote:
> Probably this requires a relcache inval at the wrong time;
> although we have recent passes from CLOBBER_CACHE_ALWAYS animals,
> so that can't be the whole triggering condition.  I wonder whether
> it is relevant that all of the complaining animals are JIT-enabled.

Hmmm ... I take that back.  hyrax has indeed passed since this went
in, but *it doesn't run any TAP tests*.  So the buildfarm offers no
information about whether the replication tests work under
CLOBBER_CACHE_ALWAYS.

Realizing that, I built an installation that way and tried to run
the subscription tests.  Results so far:

* Running 010_truncate.pl by itself passed for me.  So there's still
some unexplained factor needed to trigger the buildfarm failures.
(I'm wondering about concurrent autovacuum activity now...)

* Starting over, it appears that 001_rep_changes.pl almost immediately
gets into an infinite loop.  It does not complete the third test step,
rather infinitely waiting for progress to be made.  The publisher log
shows a repeating loop like

2020-09-13 21:16:05.734 EDT [928529] tap_sub LOG:  could not send data to client: Broken pipe
2020-09-13 21:16:05.734 EDT [928529] tap_sub CONTEXT:  slot "tap_sub", output plugin "pgoutput", in the commit
callback,associated LSN 0/1660628 
2020-09-13 21:16:05.843 EDT [928581] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:16:05.861 EDT [928582] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:16:05.929 EDT [928582] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG:  starting logical decoding for slot "tap_sub"
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL:  Streaming transactions committing after 0/1652820, reading WAL
from0/1651B20. 
2020-09-13 21:16:05.930 EDT [928582] tap_sub LOG:  logical decoding found consistent point at 0/1651B20
2020-09-13 21:16:05.930 EDT [928582] tap_sub DETAIL:  There are no running transactions.
2020-09-13 21:16:21.560 EDT [928600] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:16:37.291 EDT [928610] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:16:52.959 EDT [928627] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:17:06.866 EDT [928636] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:06.934 EDT [928636] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:06.934 EDT [928636] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:07.811 EDT [928638] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:07.880 EDT [928638] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:07.881 EDT [928638] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:07.881 EDT [928638] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:08.618 EDT [928641] 001_rep_changes.pl LOG:  statement: SELECT pg_current_wal_lsn() <= replay_lsn AND
state= 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'tap_sub'; 
2020-09-13 21:17:08.753 EDT [928642] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:08.821 EDT [928642] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:08.821 EDT [928642] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.689 EDT [928645] tap_sub LOG:  statement: SELECT pg_catalog.set_config('search_path', '', false);
2020-09-13 21:17:09.756 EDT [928645] tap_sub LOG:  received replication command: IDENTIFY_SYSTEM
2020-09-13 21:17:09.757 EDT [928645] tap_sub LOG:  received replication command: START_REPLICATION SLOT "tap_sub"
LOGICAL0/1652820 (proto_version '2', publication_names '"tap_pub","tap_pub_ins_only"') 
2020-09-13 21:17:09.757 EDT [928645] tap_sub ERROR:  replication slot "tap_sub" is active for PID 928582
2020-09-13 21:17:09.841 EDT [928582] tap_sub LOG:  could not send data to client: Broken pipe
2020-09-13 21:17:09.841 EDT [928582] tap_sub CONTEXT:  slot "tap_sub", output plugin "pgoutput", in the commit
callback,associated LSN 0/1660628 

while the subscriber is repeating

2020-09-13 21:15:01.598 EDT [928528] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:02.178 EDT [928528] ERROR:  terminating logical replication worker due to timeout
2020-09-13 21:16:02.179 EDT [920797] LOG:  background worker "logical replication worker" (PID 928528) exited with exit
code1 
2020-09-13 21:16:02.606 EDT [928571] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:03.117 EDT [928571] ERROR:  could not start WAL streaming: ERROR:  replication slot "tap_sub" is
activefor PID 928529 
2020-09-13 21:16:03.118 EDT [920797] LOG:  background worker "logical replication worker" (PID 928571) exited with exit
code1 
2020-09-13 21:16:03.544 EDT [928574] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.053 EDT [928574] ERROR:  could not start WAL streaming: ERROR:  replication slot "tap_sub" is
activefor PID 928529 
2020-09-13 21:16:04.054 EDT [920797] LOG:  background worker "logical replication worker" (PID 928574) exited with exit
code1 
2020-09-13 21:16:04.479 EDT [928576] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:16:04.990 EDT [928576] ERROR:  could not start WAL streaming: ERROR:  replication slot "tap_sub" is
activefor PID 928529 
2020-09-13 21:16:04.990 EDT [920797] LOG:  background worker "logical replication worker" (PID 928576) exited with exit
code1 
2020-09-13 21:16:05.415 EDT [928579] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-13 21:17:05.994 EDT [928579] ERROR:  terminating logical replication worker due to timeout

I'm out of patience to investigate this for tonight, but there is
something extremely broken here; maybe more than one something.

            regards, tom lane

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

14 сентября 2020 г., 05:28:21

On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Pushed.
>
> Observe the following reports:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04
>
> These are all on HEAD, and all within the last ten days, and I see
> nothing comparable in any branch before that.  So it's hard to avoid
> the conclusion that somebody broke something about ten days ago.
>

I'll analyze these reports.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Tom Lane

Дата:

14 сентября 2020 г., 05:48:51

I wrote:
> * Starting over, it appears that 001_rep_changes.pl almost immediately
> gets into an infinite loop.  It does not complete the third test step,
> rather infinitely waiting for progress to be made.

Ah, looking closer, the problem is that wal_receiver_timeout = 60s
is too short when the sender is using CCA.  It times out before we
can get through the needed data transmission.

            regards, tom lane

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

14 сентября 2020 г., 06:18:32

On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Pushed.
>
> Observe the following reports:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04
>
> These are all on HEAD, and all within the last ten days, and I see
> nothing comparable in any branch before that.  So it's hard to avoid
> the conclusion that somebody broke something about ten days ago.
>
> None of these animals provided gdb backtraces; but we do have a built-in
> trace from several, and they all look like pgoutput.so is trying to
> list_free() garbage, somewhere inside a relcache invalidation/rebuild
> scenario:
>

Yeah, this is right, and here is some initial analysis. It seems to be
failing in below code:
rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}

This list can have elements only in 'streaming' mode (need to enable
'streaming' with Create Subscription command) whereas none of the
tests in 010_truncate.pl is using 'streaming', so this list should be
empty (NULL). The two different assertion failures shown in BF reports
in list_free code are as below:
Assert(list->length > 0);
Assert(list->length <= list->max_length);

It seems to me that this list is not initialized properly when it is
not used or maybe that is true in some special circumstances because
we initialize it in get_rel_sync_entry(). I am not sure if CCI build
is impacting this in some way.

--
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

14 сентября 2020 г., 10:53:23

On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 14, 2020 at 3:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > Pushed.
> >
> > Observe the following reports:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2020-09-13%2016%3A54%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2020-09-10%2009%3A08%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2020-09-05%2020%3A22%3A02
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-04%2001%3A52%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2020-09-03%2020%3A54%3A04
> >
> > These are all on HEAD, and all within the last ten days, and I see
> > nothing comparable in any branch before that.  So it's hard to avoid
> > the conclusion that somebody broke something about ten days ago.
> >
> > None of these animals provided gdb backtraces; but we do have a built-in
> > trace from several, and they all look like pgoutput.so is trying to
> > list_free() garbage, somewhere inside a relcache invalidation/rebuild
> > scenario:
> >
>
> Yeah, this is right, and here is some initial analysis. It seems to be
> failing in below code:
> rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
>
> This list can have elements only in 'streaming' mode (need to enable
> 'streaming' with Create Subscription command) whereas none of the
> tests in 010_truncate.pl is using 'streaming', so this list should be
> empty (NULL). The two different assertion failures shown in BF reports
> in list_free code are as below:
> Assert(list->length > 0);
> Assert(list->length <= list->max_length);
>
> It seems to me that this list is not initialized properly when it is
> not used or maybe that is true in some special circumstances because
> we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> is impacting this in some way.


Even I have analyzed this but did not find any reason why the
streamed_txns list should be anything other than NULL.  The only thing
is we are initializing the entry->streamed_txns to NULL and the list
free is checking  "if (list == NIL)" then return. However IMHO, that
should not be an issue becase NIL is defined as (List*) NULL.  I am
doing further testing and investigation.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

14 сентября 2020 г., 14:19:53

On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Yeah, this is right, and here is some initial analysis. It seems to be
> > failing in below code:
> > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
> >
> > This list can have elements only in 'streaming' mode (need to enable
> > 'streaming' with Create Subscription command) whereas none of the
> > tests in 010_truncate.pl is using 'streaming', so this list should be
> > empty (NULL). The two different assertion failures shown in BF reports
> > in list_free code are as below:
> > Assert(list->length > 0);
> > Assert(list->length <= list->max_length);
> >
> > It seems to me that this list is not initialized properly when it is
> > not used or maybe that is true in some special circumstances because
> > we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> > is impacting this in some way.
>
>
> Even I have analyzed this but did not find any reason why the
> streamed_txns list should be anything other than NULL.  The only thing
> is we are initializing the entry->streamed_txns to NULL and the list
> free is checking  "if (list == NIL)" then return. However IMHO, that
> should not be an issue becase NIL is defined as (List*) NULL.
>

Yeah, that is not the issue but it is better to initialize it with NIL
for the sake of consistency. The basic issue here was we were trying
to open/lock the relation(s) before initializing this list. Now, when
we process the invalidations during open relation, we try to access
this list in rel_sync_cache_relation_cb and that leads to assertion
failure. I have reproduced the exact scenario of 010_truncate.pl via
debugger. Basically, the backend on publisher has sent the
invalidation after truncating the relation 'tab1' and while processing
the truncate message if WALSender receives that message exactly after
creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
reproduced.

The attached patch will fix the issue. What do you think?

-- 
With Regards,
Amit Kapila.

Вложения

v1-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

14 сентября 2020 г., 14:37:38

On Mon, Sep 14, 2020 at 4:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 14, 2020 at 1:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Sep 14, 2020 at 8:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Yeah, this is right, and here is some initial analysis. It seems to be
> > > failing in below code:
> > > rel_sync_cache_relation_cb(){ ...list_free(entry->streamed_txns);..}
> > >
> > > This list can have elements only in 'streaming' mode (need to enable
> > > 'streaming' with Create Subscription command) whereas none of the
> > > tests in 010_truncate.pl is using 'streaming', so this list should be
> > > empty (NULL). The two different assertion failures shown in BF reports
> > > in list_free code are as below:
> > > Assert(list->length > 0);
> > > Assert(list->length <= list->max_length);
> > >
> > > It seems to me that this list is not initialized properly when it is
> > > not used or maybe that is true in some special circumstances because
> > > we initialize it in get_rel_sync_entry(). I am not sure if CCI build
> > > is impacting this in some way.
> >
> >
> > Even I have analyzed this but did not find any reason why the
> > streamed_txns list should be anything other than NULL.  The only thing
> > is we are initializing the entry->streamed_txns to NULL and the list
> > free is checking  "if (list == NIL)" then return. However IMHO, that
> > should not be an issue becase NIL is defined as (List*) NULL.
> >
>
> Yeah, that is not the issue but it is better to initialize it with NIL
> for the sake of consistency. The basic issue here was we were trying
> to open/lock the relation(s) before initializing this list. Now, when
> we process the invalidations during open relation, we try to access
> this list in rel_sync_cache_relation_cb and that leads to assertion
> failure. I have reproduced the exact scenario of 010_truncate.pl via
> debugger. Basically, the backend on publisher has sent the
> invalidation after truncating the relation 'tab1' and while processing
> the truncate message if WALSender receives that message exactly after
> creating the RelSyncEntry for 'tab1', the Assertion shown in BF can be
> reproduced.

Yeah, this is an issue and I am also able to reproduce this manually
using gdb.  Basically, I have inserted some data in publication table
and after that, I stopped in get_rel_sync_entry after creating the
reentry and before calling GetRelationPublications.  Meanwhile, I have
truncated this table and then it hit the same issue you pointed here.

> The attached patch will fix the issue. What do you think?

The patch looks good to me and fixing the reported issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Tom Lane

Дата:

14 сентября 2020 г., 19:12:16

Amit Kapila <amit.kapila16@gmail.com> writes:
> The attached patch will fix the issue. What do you think?

I think it'd be cleaner to separate the initialization of a new entry from
validation altogether, along the lines of

    /* Find cached function info, creating if not found */
    oldctx = MemoryContextSwitchTo(CacheMemoryContext);
    entry = (RelationSyncEntry *) hash_search(RelationSyncCache,
                                              (void *) &relid,
                                              HASH_ENTER, &found);
    MemoryContextSwitchTo(oldctx);
    Assert(entry != NULL);

    if (!found)
    {
        /* immediately make a new entry valid enough to satisfy callbacks */
        entry->schema_sent = false;
        entry->streamed_txns = NIL;
        entry->replicate_valid = false;
        /* are there any other fields we should clear here for safety??? */
    }

    /* Fill it in if not valid */
    if (!entry->replicate_valid)
    {
        List       *pubids = GetRelationPublications(relid);
        ...

BTW, unless someone has changed the behavior of dynahash when I
wasn't looking, those MemoryContextSwitchTos shown above are useless.
Also, why does the comment refer to a "function" entry?

            regards, tom lane

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

15 сентября 2020 г., 05:54:44

On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > The attached patch will fix the issue. What do you think?
>
> I think it'd be cleaner to separate the initialization of a new entry from
> validation altogether, along the lines of
>
>     /* Find cached function info, creating if not found */
>     oldctx = MemoryContextSwitchTo(CacheMemoryContext);
>     entry = (RelationSyncEntry *) hash_search(RelationSyncCache,
>                                               (void *) &relid,
>                                               HASH_ENTER, &found);
>     MemoryContextSwitchTo(oldctx);
>     Assert(entry != NULL);
>
>     if (!found)
>     {
>         /* immediately make a new entry valid enough to satisfy callbacks */
>         entry->schema_sent = false;
>         entry->streamed_txns = NIL;
>         entry->replicate_valid = false;
>         /* are there any other fields we should clear here for safety??? */
>     }
>

If we want to separate validation then we need to initialize other
fields like 'pubactions' and 'publish_as_relid' as well. I think it
will be better to arrange it the way you are suggesting. So, I will
change it along with other fields that required initialization.

>     /* Fill it in if not valid */
>     if (!entry->replicate_valid)
>     {
>         List       *pubids = GetRelationPublications(relid);
>         ...
>
> BTW, unless someone has changed the behavior of dynahash when I
> wasn't looking, those MemoryContextSwitchTos shown above are useless.
>

As far as I can see they are useless in this case but I think they
might be required in case the user provides its own allocator function
(using HASH_ALLOC). So, we can probably remove those from here?

> Also, why does the comment refer to a "function" entry?
>

It should be "relation" instead. I'll take care of changing this as well.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Tom Lane

Дата:

15 сентября 2020 г., 06:08:26

Amit Kapila <amit.kapila16@gmail.com> writes:
> On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, unless someone has changed the behavior of dynahash when I
>> wasn't looking, those MemoryContextSwitchTos shown above are useless.

> As far as I can see they are useless in this case but I think they
> might be required in case the user provides its own allocator function
> (using HASH_ALLOC). So, we can probably remove those from here?

You could imagine writing a HASH_ALLOC allocator whose behavior
varies depending on CurrentMemoryContext, but it seems like a
pretty foolish/fragile way to do it.  In any case I can think of,
the hash table lives in one specific context and you really
really do not want parts of it spread across other contexts.
dynahash.c is not going to look kindly on pieces of what it
is managing disappearing from under it.

(To be clear, objects that the hash entries contain pointers to
are a different question.  But the hash entries themselves have
to have exactly the same lifespan as the hash table.)

            regards, tom lane

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

15 сентября 2020 г., 07:47:41

On Tue, Sep 15, 2020 at 8:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Mon, Sep 14, 2020 at 9:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> BTW, unless someone has changed the behavior of dynahash when I
> >> wasn't looking, those MemoryContextSwitchTos shown above are useless.
>
> > As far as I can see they are useless in this case but I think they
> > might be required in case the user provides its own allocator function
> > (using HASH_ALLOC). So, we can probably remove those from here?
>
> You could imagine writing a HASH_ALLOC allocator whose behavior
> varies depending on CurrentMemoryContext, but it seems like a
> pretty foolish/fragile way to do it.  In any case I can think of,
> the hash table lives in one specific context and you really
> really do not want parts of it spread across other contexts.
> dynahash.c is not going to look kindly on pieces of what it
> is managing disappearing from under it.
>

I agree that doesn't make sense. I have fixed all the comments
discussed in the attached patch.

-- 
With Regards,
Amit Kapila.

Вложения

v2-0001-Fix-initialization-of-RelationSyncEntry-for-strea.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

16 сентября 2020 г., 09:01:14

On Tue, Sep 15, 2020 at 10:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 15, 2020 at 8:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > As far as I can see they are useless in this case but I think they
> > > might be required in case the user provides its own allocator function
> > > (using HASH_ALLOC). So, we can probably remove those from here?
> >
> > You could imagine writing a HASH_ALLOC allocator whose behavior
> > varies depending on CurrentMemoryContext, but it seems like a
> > pretty foolish/fragile way to do it.  In any case I can think of,
> > the hash table lives in one specific context and you really
> > really do not want parts of it spread across other contexts.
> > dynahash.c is not going to look kindly on pieces of what it
> > is managing disappearing from under it.
> >
>
> I agree that doesn't make sense. I have fixed all the comments
> discussed in the attached patch.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Noah Misch

Дата:

30 ноября 2020 г., 00:44:41

On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> Thanks, I have pushed the last patch. Let's wait for a day or so to
> see the buildfarm reports

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
failed the new 015_stream.pl test with the subscriber looping like this:

2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has started
2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252) exited
withexit code 1
 
2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
...

What happened there?

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

30 ноября 2020 г., 06:52:46

On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > see the buildfarm reports
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> failed the new 015_stream.pl test with the subscriber looping like this:
>

I will look into this.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

30 ноября 2020 г., 16:20:35

On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > see the buildfarm reports
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> failed the new 015_stream.pl test with the subscriber looping like this:
>
> 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory

> 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252) exited
withexit code 1

> 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory

> ...
>
> What happened there?
>

What is going on here is that the expected streaming file is missing.
Normally, the first time we send a stream of changes (some percentage
of transaction changes) we create the streaming file, and then in
respective streams we just keep on writing in that file the changes we
receive from the publisher, and on commit, we read that file and apply
all the changes.

The above kind of error can happen due to the following reasons: (a)
the first time we sent the stream and created the file and that got
removed before the second stream reached the subscriber. (b) from the
publisher-side, we never sent the indication that it is the first
stream and the subscriber directly tries to open the file thinking it
is already there.

Now, the publisher and subscriber log doesn't directly indicate any of
the above problems but I have some observations.

The subscriber log indicates that before the apply worker exits due to
an error the new apply worker gets started. We delete the
streaming-related temporary files on proc_exit, so one possibility
could have been that the new apply worker has created the streaming
file which the old apply worker has removed but that is not possible
because we always create these temp-files by having procid in the
path.

The other thing I observed in the code is that we can mark the
transaction as streamed (via ReorderBufferTruncateTxn) if we try to
stream a transaction that has no changes the first time we try to
stream the transaction. This would lead to symptom (b) because the
second-time when there are more changes we would stream the changes as
it is not the first time. However, this shouldn't happen because we
never pick-up a transaction to stream which has no changes. I can try
to fix the code here such that we don't mark the transaction as
streamed unless we have streamed at least one change but I don't see
how it is related to this particular test failure.

I am not sure why this failure is not repeated since it occurred a few
months back, it's probably a timing issue. I have few timing issues in
the last month or so related to this feature but I am not able to come
up with a theory if any of those would have fixed this problem.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

01 декабря 2020 г., 09:08:00

On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
> >
> > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > see the buildfarm reports
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > failed the new 015_stream.pl test with the subscriber looping like this:
> >
> > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252) exited
withexit code 1
 
> > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > ...
> >
> > What happened there?
> >
>
> What is going on here is that the expected streaming file is missing.
> Normally, the first time we send a stream of changes (some percentage
> of transaction changes) we create the streaming file, and then in
> respective streams we just keep on writing in that file the changes we
> receive from the publisher, and on commit, we read that file and apply
> all the changes.
>
> The above kind of error can happen due to the following reasons: (a)
> the first time we sent the stream and created the file and that got
> removed before the second stream reached the subscriber. (b) from the
> publisher-side, we never sent the indication that it is the first
> stream and the subscriber directly tries to open the file thinking it
> is already there.
>
> Now, the publisher and subscriber log doesn't directly indicate any of
> the above problems but I have some observations.
>
> The subscriber log indicates that before the apply worker exits due to
> an error the new apply worker gets started. We delete the
> streaming-related temporary files on proc_exit, so one possibility
> could have been that the new apply worker has created the streaming
> file which the old apply worker has removed but that is not possible
> because we always create these temp-files by having procid in the
> path.

Yeah, and I have tried to test on this line, basically, after the
streaming has started I have set the binary=on.  Now using gdb I have
made the worker wait before it deletes the temp file and meanwhile the
new worker started and it worked properly as expected.

> The other thing I observed in the code is that we can mark the
> transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> stream a transaction that has no changes the first time we try to
> stream the transaction. This would lead to symptom (b) because the
> second-time when there are more changes we would stream the changes as
> it is not the first time. However, this shouldn't happen because we
> never pick-up a transaction to stream which has no changes. I can try
> to fix the code here such that we don't mark the transaction as
> streamed unless we have streamed at least one change but I don't see
> how it is related to this particular test failure.

Yeah, this can be improved but as you mentioned that we never select
an empty transaction for streaming so this case should not occur.  I
will perform some testing/review around this and report.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

01 декабря 2020 г., 13:13:01

On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > What is going on here is that the expected streaming file is missing.
> > Normally, the first time we send a stream of changes (some percentage
> > of transaction changes) we create the streaming file, and then in
> > respective streams we just keep on writing in that file the changes we
> > receive from the publisher, and on commit, we read that file and apply
> > all the changes.
> >
> > The above kind of error can happen due to the following reasons: (a)
> > the first time we sent the stream and created the file and that got
> > removed before the second stream reached the subscriber. (b) from the
> > publisher-side, we never sent the indication that it is the first
> > stream and the subscriber directly tries to open the file thinking it
> > is already there.
> >
> > Now, the publisher and subscriber log doesn't directly indicate any of
> > the above problems but I have some observations.
> >
> > The subscriber log indicates that before the apply worker exits due to
> > an error the new apply worker gets started. We delete the
> > streaming-related temporary files on proc_exit, so one possibility
> > could have been that the new apply worker has created the streaming
> > file which the old apply worker has removed but that is not possible
> > because we always create these temp-files by having procid in the
> > path.
>
> Yeah, and I have tried to test on this line, basically, after the
> streaming has started I have set the binary=on.  Now using gdb I have
> made the worker wait before it deletes the temp file and meanwhile the
> new worker started and it worked properly as expected.
>
> > The other thing I observed in the code is that we can mark the
> > transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> > stream a transaction that has no changes the first time we try to
> > stream the transaction. This would lead to symptom (b) because the
> > second-time when there are more changes we would stream the changes as
> > it is not the first time. However, this shouldn't happen because we
> > never pick-up a transaction to stream which has no changes. I can try
> > to fix the code here such that we don't mark the transaction as
> > streamed unless we have streamed at least one change but I don't see
> > how it is related to this particular test failure.
>
> Yeah, this can be improved but as you mentioned that we never select
> an empty transaction for streaming so this case should not occur.  I
> will perform some testing/review around this and report.
>

On further thinking about this point, I think the message seen on
subscriber [1] won't occur if missed the first stream. This is because
we always check the value of fileset from the stream hash table
(xidhash) and it won't be there if we directly send the second stream
and that would have lead to a different kind of problem (probably
crash). This symptom seems to be due to the reason (a) mentioned above
unless we are missing something else. Now, I am not sure how the file
can be removed without the corresponding entry in hash table (xidhash)
is still present. The only reasons that come to mind are that some
other process cleaned pgsql_tmp directory thinking these temporary
file are not required or one manually removes it, none of those seems
plausible reasons.

[1] - ERROR: could not open temporary file "16393-510.changes.0" from
BufFile "16393-510.changes": No such file or directory

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

02 декабря 2020 г., 10:50:40

On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
> > >
> > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > > see the buildfarm reports
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > > failed the new 015_stream.pl test with the subscriber looping like this:
> > >
> > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252)
exitedwith exit code 1
 
> > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from BufFile
"16393-510.changes":No such file or directory
 
> > > ...
> > >
> > > What happened there?
> > >
> >
> > What is going on here is that the expected streaming file is missing.
> > Normally, the first time we send a stream of changes (some percentage
> > of transaction changes) we create the streaming file, and then in
> > respective streams we just keep on writing in that file the changes we
> > receive from the publisher, and on commit, we read that file and apply
> > all the changes.
> >
> > The above kind of error can happen due to the following reasons: (a)
> > the first time we sent the stream and created the file and that got
> > removed before the second stream reached the subscriber. (b) from the
> > publisher-side, we never sent the indication that it is the first
> > stream and the subscriber directly tries to open the file thinking it
> > is already there.
> >
> > Now, the publisher and subscriber log doesn't directly indicate any of
> > the above problems but I have some observations.
> >
> > The subscriber log indicates that before the apply worker exits due to
> > an error the new apply worker gets started. We delete the
> > streaming-related temporary files on proc_exit, so one possibility
> > could have been that the new apply worker has created the streaming
> > file which the old apply worker has removed but that is not possible
> > because we always create these temp-files by having procid in the
> > path.
>
> Yeah, and I have tried to test on this line, basically, after the
> streaming has started I have set the binary=on.  Now using gdb I have
> made the worker wait before it deletes the temp file and meanwhile the
> new worker started and it worked properly as expected.
>
> > The other thing I observed in the code is that we can mark the
> > transaction as streamed (via ReorderBufferTruncateTxn) if we try to
> > stream a transaction that has no changes the first time we try to
> > stream the transaction. This would lead to symptom (b) because the
> > second-time when there are more changes we would stream the changes as
> > it is not the first time. However, this shouldn't happen because we
> > never pick-up a transaction to stream which has no changes. I can try
> > to fix the code here such that we don't mark the transaction as
> > streamed unless we have streamed at least one change but I don't see
> > how it is related to this particular test failure.
>
> Yeah, this can be improved but as you mentioned that we never select
> an empty transaction for streaming so this case should not occur.  I
> will perform some testing/review around this and report.

I have executed "make check" in the loop with only this file.  I have
repeated it 5000 times but no failure, I am wondering shall we try to
execute in the same machine in a loop where it failed once?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

02 декабря 2020 г., 11:20:25

On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Dec 1, 2020 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
> > > >
> > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > > > see the buildfarm reports
> > > >
> > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > > > failed the new 015_stream.pl test with the subscriber looping like this:
> > > >
> > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from
BufFile"16393-510.changes": No such file or directory
 
> > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252)
exitedwith exit code 1
 
> > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from
BufFile"16393-510.changes": No such file or directory
 
> > > > ...
> > > >
> > > > What happened there?
> > > >
> > >
> > > What is going on here is that the expected streaming file is missing.
> > > Normally, the first time we send a stream of changes (some percentage
> > > of transaction changes) we create the streaming file, and then in
> > > respective streams we just keep on writing in that file the changes we
> > > receive from the publisher, and on commit, we read that file and apply
> > > all the changes.
> > >
> > > The above kind of error can happen due to the following reasons: (a)
> > > the first time we sent the stream and created the file and that got
> > > removed before the second stream reached the subscriber. (b) from the
> > > publisher-side, we never sent the indication that it is the first
> > > stream and the subscriber directly tries to open the file thinking it
> > > is already there.
> > >
>
> I have executed "make check" in the loop with only this file.  I have
> repeated it 5000 times but no failure, I am wondering shall we try to
> execute in the same machine in a loop where it failed once?
>

Yes, that might help. Noah, would it be possible for you to try that
out, and if it failed then probably get the stack trace of subscriber?
If we are able to reproduce it then we can add elogs in functions
SharedFileSetInit, BufFileCreateShared, BufFileOpenShared, and
SharedFileSetDeleteAll to print the paths to see if we are sometimes
unintentionally removing some files. I have checked the code and there
doesn't appear to be any such problems but I might be missing
something.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Noah Misch

Дата:

09 декабря 2020 г., 12:26:25

On Wed, Dec 02, 2020 at 01:50:25PM +0530, Amit Kapila wrote:
> On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <noah@leadboat.com> wrote:
> > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote:
> > > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to
> > > > > > see the buildfarm reports
> > > > >
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14
> > > > > failed the new 015_stream.pl test with the subscriber looping like this:
> > > > >
> > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG:  logical replication apply worker for subscription "tap_sub"
hasstarted
 
> > > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR:  could not open temporary file "16393-510.changes.0" from
BufFile"16393-510.changes": No such file or directory
 
> > > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG:  logical replication apply worker for subscription "tap_sub" has
started
> > > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG:  background worker "logical replication worker" (PID 13959252)
exitedwith exit code 1
 
> > > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR:  could not open temporary file "16393-510.changes.0" from
BufFile"16393-510.changes": No such file or directory
 
> > > > > ...

> > > > The above kind of error can happen due to the following reasons: (a)
> > > > the first time we sent the stream and created the file and that got
> > > > removed before the second stream reached the subscriber. (b) from the
> > > > publisher-side, we never sent the indication that it is the first
> > > > stream and the subscriber directly tries to open the file thinking it
> > > > is already there.

Further testing showed it was a file location problem, not a deletion problem.
The worker tried to open
base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these
were the files actually existing:

[nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name '*sharedfileset*')
src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset:
total 408
drwx------    2 nm       usr             256 Dec 08 03:20 .
drwx------    4 nm       usr             256 Dec 08 03:20 ..
-rw-------    1 nm       usr          207806 Dec 08 03:20 16393-510.changes.0

src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset:
total 0
drwx------    2 nm       usr             256 Dec 08 03:20 .
drwx------    4 nm       usr             256 Dec 08 03:20 ..
-rw-------    1 nm       usr               0 Dec 08 03:20 16393-511.changes.0

> > I have executed "make check" in the loop with only this file.  I have
> > repeated it 5000 times but no failure, I am wondering shall we try to
> > execute in the same machine in a loop where it failed once?
> 
> Yes, that might help. Noah, would it be possible for you to try that

The problem is xidhash using strcmp() to compare keys; it needs memcmp().  For
this to matter, xidhash must contain more than one element.  Existing tests
rarely exercise the multi-element scenario.  Under heavy load, on this system,
the test publisher can have two active transactions at once, in which case it
does exercise multi-element xidhash.  (The publisher is sensitive to timing,
but the subscriber is not; once WAL contains interleaved records of two XIDs,
the subscriber fails every time.)  This would be much harder to reproduce on a
little-endian system, where strcmp(&xid, &xid_plus_one)!=0.  On big-endian,
every small XID has zero in the first octet; they all look like empty strings.

The attached patch has the one-line fix and some test suite changes that make
this reproduce frequently on any big-endian system.  I'm currently planning to
drop the test suite changes from the commit, but I could keep them if folks
like them.  (They'd need more comments and timeout handling.)

Вложения

xidhash-blobs-v1.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

09 декабря 2020 г., 13:00:37

On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah@leadboat.com> wrote:
>
> Further testing showed it was a file location problem, not a deletion problem.
> The worker tried to open
> base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these
> were the files actually existing:
>
> [nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name
'*sharedfileset*')
> src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset:
> total 408
> drwx------    2 nm       usr             256 Dec 08 03:20 .
> drwx------    4 nm       usr             256 Dec 08 03:20 ..
> -rw-------    1 nm       usr          207806 Dec 08 03:20 16393-510.changes.0
>
> src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset:
> total 0
> drwx------    2 nm       usr             256 Dec 08 03:20 .
> drwx------    4 nm       usr             256 Dec 08 03:20 ..
> -rw-------    1 nm       usr               0 Dec 08 03:20 16393-511.changes.0
>
> > > I have executed "make check" in the loop with only this file.  I have
> > > repeated it 5000 times but no failure, I am wondering shall we try to
> > > execute in the same machine in a loop where it failed once?
> >
> > Yes, that might help. Noah, would it be possible for you to try that
>
> The problem is xidhash using strcmp() to compare keys; it needs memcmp().  For
> this to matter, xidhash must contain more than one element.  Existing tests
> rarely exercise the multi-element scenario.  Under heavy load, on this system,
> the test publisher can have two active transactions at once, in which case it
> does exercise multi-element xidhash.  (The publisher is sensitive to timing,
> but the subscriber is not; once WAL contains interleaved records of two XIDs,
> the subscriber fails every time.)  This would be much harder to reproduce on a
> little-endian system, where strcmp(&xid, &xid_plus_one)!=0.  On big-endian,
> every small XID has zero in the first octet; they all look like empty strings.
>

Your analysis is correct.

> The attached patch has the one-line fix and some test suite changes that make
> this reproduce frequently on any big-endian system.  I'm currently planning to
> drop the test suite changes from the commit, but I could keep them if folks
> like them.  (They'd need more comments and timeout handling.)
>

I think it is better to keep this test which can always test multiple
streams on the subscriber.

Thanks for working on this.

-- 
With Regards,
Amit Kapila.

HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

От

Tom Lane

Дата:

13 декабря 2020 г., 19:49:31

Amit Kapila <amit.kapila16@gmail.com> writes:
> On Wed, Dec 9, 2020 at 2:56 PM Noah Misch <noah@leadboat.com> wrote:
>> The problem is xidhash using strcmp() to compare keys; it needs memcmp().

> Your analysis is correct.

Sorry for not having noticed this thread before.  Noah's fix is
clearly correct, and I have no objection to the added test case.
But what jumps out at me here is that this sort of error seems way
too easy to make, and evidently way too hard to detect.  What can we
do to make it more obvious if one has incorrectly used or omitted
HASH_BLOBS?  Both directions of error might easily escape notice on
little-endian hardware.

I thought of a few ideas, all of which have drawbacks:

1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
This seems to just move the problem somewhere else, besides which
it'd require touching an awful lot of callers, and would silently
break third-party callers.

2. Don't allow a default: invent a new HASH_STRING flag, and
require that hash_create() calls specify exactly one of HASH_BLOBS,
HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
hazard of mindless-copy-and-paste, but I think it might make it
a little more obvious.  Still requires touching a lot of calls.

3. Add some sort of heuristic restriction on keysize.  A keysize
that's only 4 or 8 bytes almost certainly is not a string.
This doesn't give us much traction for larger keysizes, though.

4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
in string_hash().  I think we could get away with that given that
SQL disallows empty identifiers.  However, it would only help to
catch one direction of error (omitting HASH_BLOBS), and it would
only help on big-endian hardware, which is getting harder to find.
Still, we could hope that the buildfarm would detect errors.

There might be some more options.  Also, some of these ideas
could be applied in combination.

A quick count of grep hits suggest that the large majority of
existing hash_create() calls use HASH_BLOBS, and there might be
only order-of-ten calls that would need to be touched if we
required an explicit HASH_STRING flag.  So option #2 is seeming
kind of attractive.  Maybe that together with an assertion that
string keys have to exceed 8 or 16 bytes would be enough protection.

Also, this census now suggests to me that the opposite problem
(copy-and-paste HASH_BLOBS when you meant string keys) might be
a real hazard, since so many of the existing prototypes that you
might copy have HASH_BLOBS.  I'm not sure if there's much to be
done for this case though.  A small saving grace is that it seems
relatively likely that you'd notice a functional problem pretty
quickly with this type of mistake, since lookups would tend to
fail due to trailing garbage after your lookup string.

A different angle we could think about is that the name "HASH_BLOBS"
is kind of un-obvious.  Maybe we should deprecate that spelling in
favor of something like "HASH_BINARY".

Thoughts?

            regards, tom lane

Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

От

Noah Misch

Дата:

13 декабря 2020 г., 23:06:08

On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
> But what jumps out at me here is that this sort of error seems way
> too easy to make, and evidently way too hard to detect.  What can we
> do to make it more obvious if one has incorrectly used or omitted
> HASH_BLOBS?  Both directions of error might easily escape notice on
> little-endian hardware.
> 
> I thought of a few ideas, all of which have drawbacks:
> 
> 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
> This seems to just move the problem somewhere else, besides which
> it'd require touching an awful lot of callers, and would silently
> break third-party callers.
> 
> 2. Don't allow a default: invent a new HASH_STRING flag, and
> require that hash_create() calls specify exactly one of HASH_BLOBS,
> HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
> hazard of mindless-copy-and-paste, but I think it might make it
> a little more obvious.  Still requires touching a lot of calls.

I like (2), for making the bug harder and for greppability.  Probably
pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS.

> 3. Add some sort of heuristic restriction on keysize.  A keysize
> that's only 4 or 8 bytes almost certainly is not a string.
> This doesn't give us much traction for larger keysizes, though.
> 
> 4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
> in string_hash().  I think we could get away with that given that
> SQL disallows empty identifiers.  However, it would only help to
> catch one direction of error (omitting HASH_BLOBS), and it would
> only help on big-endian hardware, which is getting harder to find.
> Still, we could hope that the buildfarm would detect errors.

It's nontrivial to confirm that the empty-string key can't happen for a given
hash table.  (In contrast, what (3) asserts on is usually a compile-time
constant.)  I would stop short of adding (4), though it could be okay.

> A quick count of grep hits suggest that the large majority of
> existing hash_create() calls use HASH_BLOBS, and there might be
> only order-of-ten calls that would need to be touched if we
> required an explicit HASH_STRING flag.  So option #2 is seeming
> kind of attractive.  Maybe that together with an assertion that
> string keys have to exceed 8 or 16 bytes would be enough protection.

Agreed.  I expect (2) gives most of the benefit.  Requiring 8-byte capacity
should be harmless, and most architectures can zero 8 bytes in one
instruction.  Requiring more bytes trades specificity for sensitivity.

> A different angle we could think about is that the name "HASH_BLOBS"
> is kind of un-obvious.  Maybe we should deprecate that spelling in
> favor of something like "HASH_BINARY".

With (2) in place, I wouldn't worry about renaming HASH_BLOBS.  It's hard to
confuse with HASH_STRINGS or HASH_FUNCTION.  If anything, HASH_BLOBS conveys
something more specific.  HASH_FUNCTION cases see binary data, but that data
has structure that promotes it out of "blob" status.

Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

От

Peter Eisentraut

Дата:

14 декабря 2020 г., 13:28:18

On 2020-12-13 17:49, Tom Lane wrote:
> 2. Don't allow a default: invent a new HASH_STRING flag, and
> require that hash_create() calls specify exactly one of HASH_BLOBS,
> HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
> hazard of mindless-copy-and-paste, but I think it might make it
> a little more obvious.  Still requires touching a lot of calls.

I think this sounds best, and also expand the documentation of these 
flags a bit.

-- 
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/

Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

От

Amit Kapila

Дата:

14 декабря 2020 г., 13:32:40

On Mon, Dec 14, 2020 at 1:36 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
> > But what jumps out at me here is that this sort of error seems way
> > too easy to make, and evidently way too hard to detect.  What can we
> > do to make it more obvious if one has incorrectly used or omitted
> > HASH_BLOBS?  Both directions of error might easily escape notice on
> > little-endian hardware.
> >
> > I thought of a few ideas, all of which have drawbacks:
> >
> > 1. Invert the sense of the flag, ie HASH_BLOBS becomes the default.
> > This seems to just move the problem somewhere else, besides which
> > it'd require touching an awful lot of callers, and would silently
> > break third-party callers.
> >
> > 2. Don't allow a default: invent a new HASH_STRING flag, and
> > require that hash_create() calls specify exactly one of HASH_BLOBS,
> > HASH_STRING, or HASH_FUNCTION.  This doesn't completely fix the
> > hazard of mindless-copy-and-paste, but I think it might make it
> > a little more obvious.  Still requires touching a lot of calls.
>
> I like (2), for making the bug harder and for greppability.  Probably
> pluralize it to HASH_STRINGS, for the parallel with HASH_BLOBS.
>
> > 3. Add some sort of heuristic restriction on keysize.  A keysize
> > that's only 4 or 8 bytes almost certainly is not a string.
> > This doesn't give us much traction for larger keysizes, though.
> >
> > 4. Disallow empty string keys, ie something like "Assert(s_len > 0)"
> > in string_hash().  I think we could get away with that given that
> > SQL disallows empty identifiers.  However, it would only help to
> > catch one direction of error (omitting HASH_BLOBS), and it would
> > only help on big-endian hardware, which is getting harder to find.
> > Still, we could hope that the buildfarm would detect errors.
>
> It's nontrivial to confirm that the empty-string key can't happen for a given
> hash table.  (In contrast, what (3) asserts on is usually a compile-time
> constant.)  I would stop short of adding (4), though it could be okay.
>
> > A quick count of grep hits suggest that the large majority of
> > existing hash_create() calls use HASH_BLOBS, and there might be
> > only order-of-ten calls that would need to be touched if we
> > required an explicit HASH_STRING flag.  So option #2 is seeming
> > kind of attractive.  Maybe that together with an assertion that
> > string keys have to exceed 8 or 16 bytes would be enough protection.
>
> Agreed.  I expect (2) gives most of the benefit.  Requiring 8-byte capacity
> should be harmless, and most architectures can zero 8 bytes in one
> instruction.  Requiring more bytes trades specificity for sensitivity.
>

+1. I also think in most cases (2) would be sufficient to avoid such
bugs. Adding restriction on string size might annoy some out-of-core
user which is already using small strings. However, adding an 8-byte
restriction on string size would be still okay.

-- 
With Regards,
Amit Kapila.

Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

От

Tom Lane

Дата:

14 декабря 2020 г., 21:59:03

Noah Misch <noah@leadboat.com> writes:
> On Sun, Dec 13, 2020 at 11:49:31AM -0500, Tom Lane wrote:
>> A quick count of grep hits suggest that the large majority of
>> existing hash_create() calls use HASH_BLOBS, and there might be
>> only order-of-ten calls that would need to be touched if we
>> required an explicit HASH_STRING flag.  So option #2 is seeming
>> kind of attractive.  Maybe that together with an assertion that
>> string keys have to exceed 8 or 16 bytes would be enough protection.

> Agreed.  I expect (2) gives most of the benefit.  Requiring 8-byte capacity
> should be harmless, and most architectures can zero 8 bytes in one
> instruction.  Requiring more bytes trades specificity for sensitivity.

Attached is a proposed patch that requires HASH_STRINGS to be stated
explicitly (in the event, there are 13 callers needing that) and insists
on keysize > 8 for string keys.  In examining the now-easily-visible uses
of string keys, almost all of them are using NAMEDATALEN-sized keys, or
in a few places larger values.  Only two are smaller:

1. ShmemIndex uses SHMEM_INDEX_KEYSIZE, which is only set to 48.

2. ResetUnloggedRelationsInDbspaceDir is using OIDCHARS + 1, because
it stores relfilenode OIDs as strings.  That seems pretty damfool
to me, so I'm inclined to change it to store binary OIDs instead;
those'd be a third the size (or probably a quarter the size after
alignment padding) and likely faster to hash or compare.  But I
didn't do that here, since it's still more than 8.  (I did whack
it upside the head to the extent of not storing its temporary
hash table in CacheMemoryContext.)

So it seems to me that insisting on keysize > 8 is fine.

There are a couple of other API oddities that maybe we should think
about while we're here:

* Should we just have a blanket insistence that all callers supply
HASH_ELEM?  The default sizes that dynahash.c uses without that are
undocumented and basically useless.  We're already asserting that
in the HASH_BLOBS path, which is the majority use-case, and this
patch now asserts it for HASH_STRINGS too.

* The coding convention that the HASHCTL argument struct should be
pre-zeroed seems to have been ignored at a lot of call sites.
I added a memset call to a couple of callers that I was touching
in this patch, but I'm having second thoughts about that.  Maybe
we should just rip out all those memsets as pointless, since there's
basically no case where you'd use the memset to fill a field that
you meant to pass as zero.  The fact that hash_create() doesn't
read fields it's not told to by a flag means we should not need
the memsets to avoid uninitialized-memory reads.

            regards, tom lane

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 2dc9e44ae6..8b17fb06eb 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2604,10 +2604,12 @@ createConnHash(void)
 {
     HASHCTL        ctl;

+    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = NAMEDATALEN;
     ctl.entrysize = sizeof(remoteConnHashEnt);

-    return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM);
+    return hash_create("Remote Con hash", NUMCONN, &ctl,
+                       HASH_ELEM | HASH_STRINGS);
 }

 static void
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 85986ec24a..ec7819ca77 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -726,7 +726,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
     crosstab_hash = hash_create("crosstab hash",
                                 INIT_CATS,
                                 &ctl,
-                                HASH_ELEM | HASH_CONTEXT);
+                                HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

     /* Connect to SPI manager */
     if ((ret = SPI_connect()) < 0)
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 4b18be5b27..5ba7c2eb3c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -414,7 +414,7 @@ InitQueryHashTable(void)
     prepared_queries = hash_create("Prepared Queries",
                                    32,
                                    &hash_ctl,
-                                   HASH_ELEM);
+                                   HASH_ELEM | HASH_STRINGS);
 }

 /*
diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c
index ab04459c55..2fe89fd361 100644
--- a/src/backend/nodes/extensible.c
+++ b/src/backend/nodes/extensible.c
@@ -51,7 +51,8 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label,
         ctl.keysize = EXTNODENAME_MAX_LEN;
         ctl.entrysize = sizeof(ExtensibleNodeEntry);

-        *p_htable = hash_create(htable_label, 100, &ctl, HASH_ELEM);
+        *p_htable = hash_create(htable_label, 100, &ctl,
+                                HASH_ELEM | HASH_STRINGS);
     }

     if (strlen(extnodename) >= EXTNODENAME_MAX_LEN)
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0c2094f766..f21ab67ae4 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -175,7 +175,9 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
         memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(unlogged_relation_entry);
         ctl.entrysize = sizeof(unlogged_relation_entry);
-        hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
+        ctl.hcxt = CurrentMemoryContext;
+        hash = hash_create("unlogged hash", 32, &ctl,
+                           HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

         /* Scan the directory. */
         dbspace_dir = AllocateDir(dbspacedirname);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 97716f6aef..0afd87e075 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -292,7 +292,6 @@ void
 InitShmemIndex(void)
 {
     HASHCTL        info;
-    int            hash_flags;

     /*
      * Create the shared memory shmem index.
@@ -302,13 +301,14 @@ InitShmemIndex(void)
      * initializing the ShmemIndex itself.  The special "ShmemIndex" hash
      * table name will tell ShmemInitStruct to fake it.
      */
+    memset(&info, 0, sizeof(info));
     info.keysize = SHMEM_INDEX_KEYSIZE;
     info.entrysize = sizeof(ShmemIndexEnt);
-    hash_flags = HASH_ELEM;

     ShmemIndex = ShmemInitHash("ShmemIndex",
                                SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
-                               &info, hash_flags);
+                               &info,
+                               HASH_ELEM | HASH_STRINGS);
 }

 /*
@@ -329,6 +329,10 @@ InitShmemIndex(void)
  * whose maximum size is certain, this should be equal to max_size; that
  * ensures that no run-time out-of-shared-memory failures can occur.
  *
+ * *infoP and hash_flags should specify at least the entry sizes and key
+ * comparison semantics (see hash_create()).  Flag bits specific to
+ * shared-memory hash tables are added here.
+ *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 12557ce3af..be0a45b55e 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -3446,7 +3446,7 @@ get_json_object_as_hash(char *json, int len, const char *funcname)
     tab = hash_create("json object hashtable",
                       100,
                       &ctl,
-                      HASH_ELEM | HASH_CONTEXT);
+                      HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

     state = palloc0(sizeof(JHashState));
     sem = palloc0(sizeof(JsonSemAction));
@@ -3838,7 +3838,7 @@ populate_recordset_object_start(void *state)
     _state->json_hash = hash_create("json object hashtable",
                                     100,
                                     &ctl,
-                                    HASH_ELEM | HASH_CONTEXT);
+                                    HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 }

 static void
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ad582f99a5..87a3154c1a 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -3471,7 +3471,7 @@ set_rtable_names(deparse_namespace *dpns, List *parent_namespaces,
     names_hash = hash_create("set_rtable_names names",
                              list_length(dpns->rtable),
                              &hash_ctl,
-                             HASH_ELEM | HASH_CONTEXT);
+                             HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
     /* Preload the hash table with names appearing in parent_namespaces */
     foreach(lc, parent_namespaces)
     {
diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c
index bd779fdaf7..e83e30defe 100644
--- a/src/backend/utils/fmgr/dfmgr.c
+++ b/src/backend/utils/fmgr/dfmgr.c
@@ -686,7 +686,7 @@ find_rendezvous_variable(const char *varName)
         rendezvousHash = hash_create("Rendezvous variable hash",
                                      16,
                                      &ctl,
-                                     HASH_ELEM);
+                                     HASH_ELEM | HASH_STRINGS);
     }

     /* Find or create the hashtable entry for this varName */
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index d14d875c93..07cae638df 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -30,11 +30,12 @@
  * dynahash.c provides support for these types of lookup keys:
  *
  * 1. Null-terminated C strings (truncated if necessary to fit in keysize),
- * compared as though by strcmp().  This is the default behavior.
+ * compared as though by strcmp().  This is selected by specifying the
+ * HASH_STRINGS flag to hash_create.
  *
  * 2. Arbitrary binary data of size keysize, compared as though by memcmp().
  * (Caller must ensure there are no undefined padding bits in the keys!)
- * This is selected by specifying HASH_BLOBS flag to hash_create.
+ * This is selected by specifying the HASH_BLOBS flag to hash_create.
  *
  * 3. More complex key behavior can be selected by specifying user-supplied
  * hashing, comparison, and/or key-copying functions.  At least a hashing
@@ -47,8 +48,8 @@
  *   locks.
  * - Shared memory hashes are allocated in a fixed size area at startup and
  *   are discoverable by name from other processes.
- * - Because entries don't need to be moved in the case of hash conflicts, has
- *   better performance for large entries
+ * - Because entries don't need to be moved in the case of hash conflicts,
+ *   dynahash has better performance for large entries.
  * - Guarantees stable pointers to entries.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
@@ -316,6 +317,12 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *    *info: additional table parameters, as indicated by flags
  *    flags: bitmask indicating which parameters to take from *info
  *
+ * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * or HASH_FUNCTION, to define the key hashing semantics (C strings,
+ * binary blobs, or custom, respectively).  Callers specifying a custom
+ * hash function will likely also want to use HASH_COMPARE, and perhaps
+ * also HASH_KEYCOPY, to control key comparison and copying.
+ *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
  * hashtable can be expanded on-the-fly, so it's better for nelem to be
@@ -370,9 +377,13 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
      * Select the appropriate hash function (see comments at head of file).
      */
     if (flags & HASH_FUNCTION)
+    {
+        Assert(!(flags & (HASH_BLOBS | HASH_STRINGS)));
         hashp->hash = info->hash;
+    }
     else if (flags & HASH_BLOBS)
     {
+        Assert(!(flags & HASH_STRINGS));
         /* We can optimize hashing for common key sizes */
         Assert(flags & HASH_ELEM);
         if (info->keysize == sizeof(uint32))
@@ -381,17 +392,30 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
             hashp->hash = tag_hash;
     }
     else
-        hashp->hash = string_hash;    /* default hash function */
+    {
+        /*
+         * string_hash used to be considered the default hash method, and in a
+         * non-assert build it effectively still is.  But we now consider it
+         * an assertion error to not say HASH_STRINGS explicitly.  To help
+         * catch mistaken usage of HASH_STRINGS, we also insist on a
+         * reasonably long string length: if the keysize is only 4 or 8 bytes,
+         * it's almost certainly an integer or pointer not a string.
+         */
+        Assert(flags & HASH_STRINGS);
+        Assert(flags & HASH_ELEM);
+        Assert(info->keysize > 8);
+
+        hashp->hash = string_hash;
+    }

     /*
      * If you don't specify a match function, it defaults to string_compare if
-     * you used string_hash (either explicitly or by default) and to memcmp
-     * otherwise.
+     * you used string_hash, and to memcmp otherwise.
      *
      * Note: explicitly specifying string_hash is deprecated, because this
      * might not work for callers in loadable modules on some platforms due to
      * referencing a trampoline instead of the string_hash function proper.
-     * Just let it default, eh?
+     * Specify HASH_STRINGS instead.
      */
     if (flags & HASH_COMPARE)
         hashp->match = info->match;
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index ec6f80ee99..a382c4219b 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -111,6 +111,7 @@ EnablePortalManager(void)
                                              "TopPortalContext",
                                              ALLOCSET_DEFAULT_SIZES);

+    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = MAX_PORTALNAME_LEN;
     ctl.entrysize = sizeof(PortalHashEnt);

@@ -119,7 +120,7 @@ EnablePortalManager(void)
      * create, initially
      */
     PortalHashTable = hash_create("Portal hash", PORTALS_PER_USER,
-                                  &ctl, HASH_ELEM);
+                                  &ctl, HASH_ELEM | HASH_STRINGS);
 }

 /*
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index bebf89b3c4..666ad33567 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -82,7 +82,8 @@ typedef struct HASHCTL
 #define HASH_PARTITION    0x0001    /* Hashtable is used w/partitioned locking */
 #define HASH_SEGMENT    0x0002    /* Set segment size */
 #define HASH_DIRSIZE    0x0004    /* Set directory size (initial and max) */
-#define HASH_ELEM        0x0010    /* Set keysize and entrysize */
+#define HASH_ELEM        0x0008    /* Set keysize and entrysize */
+#define HASH_STRINGS    0x0010    /* Select support functions for string keys */
 #define HASH_BLOBS        0x0020    /* Select support functions for binary keys */
 #define HASH_FUNCTION    0x0040    /* Set user defined hash function */
 #define HASH_COMPARE    0x0080    /* Set user defined comparison function */
@@ -119,7 +120,8 @@ typedef struct
  *
  * Note: It is deprecated for callers of hash_create to explicitly specify
  * string_hash, tag_hash, uint32_hash, or oid_hash.  Just set HASH_BLOBS or
- * not.  Use HASH_FUNCTION only when you want something other than those.
+ * HASH_STRINGS.  Use HASH_FUNCTION only when you want something other than
+ * one of these.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
                          HASHCTL *info, int flags);
diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c
index 4de756455d..60f5d66264 100644
--- a/src/pl/plperl/plperl.c
+++ b/src/pl/plperl/plperl.c
@@ -586,7 +586,7 @@ select_perl_context(bool trusted)
         interp_desc->query_hash = hash_create("PL/Perl queries",
                                               32,
                                               &hash_ctl,
-                                              HASH_ELEM);
+                                              HASH_ELEM | HASH_STRINGS);
     }

     /*
diff --git a/src/timezone/pgtz.c b/src/timezone/pgtz.c
index 3f0fb51e91..5240cab022 100644
--- a/src/timezone/pgtz.c
+++ b/src/timezone/pgtz.c
@@ -211,7 +211,7 @@ init_timezone_hashtable(void)
     timezone_cache = hash_create("Timezones",
                                  4,
                                  &hash_ctl,
-                                 HASH_ELEM);
+                                 HASH_ELEM | HASH_STRINGS);
     if (!timezone_cache)
         return false;

Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

От

Tom Lane

Дата:

15 декабря 2020 г., 01:01:37

I wrote:
> There are a couple of other API oddities that maybe we should think
> about while we're here:

> * Should we just have a blanket insistence that all callers supply
> HASH_ELEM?  The default sizes that dynahash.c uses without that are
> undocumented and basically useless.  We're already asserting that
> in the HASH_BLOBS path, which is the majority use-case, and this
> patch now asserts it for HASH_STRINGS too.

Here's a follow-up patch for that part, which also tries to respond
a bit to Heikki's complaint about skimpy documentation.  While at it,
I const-ified the HASHCTL argument, since there's no need for
hash_create to modify that.

            regards, tom lane

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 07cae638df..49f21b77bb 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -317,11 +317,20 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *    *info: additional table parameters, as indicated by flags
  *    flags: bitmask indicating which parameters to take from *info
  *
- * The flags value must include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * The flags value *must* include HASH_ELEM.  (Formerly, this was nominally
+ * optional, but the default keysize and entrysize values were useless.)
+ * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS,
  * or HASH_FUNCTION, to define the key hashing semantics (C strings,
  * binary blobs, or custom, respectively).  Callers specifying a custom
  * hash function will likely also want to use HASH_COMPARE, and perhaps
  * also HASH_KEYCOPY, to control key comparison and copying.
+ * Another often-used flag is HASH_CONTEXT, to allocate the hash table
+ * under info->hcxt rather than under TopMemoryContext; the default
+ * behavior is only suitable for session-lifespan hash tables.
+ * Other flags bits are special-purpose and seldom used.
+ *
+ * Fields in *info are read only when the associated flags bit is set.
+ * It is not necessary to initialize other fields of *info.
  *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
@@ -330,11 +339,19 @@ string_compare(const char *key1, const char *key2, Size keysize)
  * large nelem will penalize hash_seq_search speed without buying much.
  */
 HTAB *
-hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
+hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
     HTAB       *hashp;
     HASHHDR    *hctl;

+    /*
+     * Hash tables now allocate space for key and data, but you have to say
+     * how much space to allocate.
+     */
+    Assert(flags & HASH_ELEM);
+    Assert(info->keysize > 0);
+    Assert(info->entrysize >= info->keysize);
+
     /*
      * For shared hash tables, we have a local hash header (HTAB struct) that
      * we allocate in TopMemoryContext; all else is in shared memory.
@@ -385,7 +402,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
     {
         Assert(!(flags & HASH_STRINGS));
         /* We can optimize hashing for common key sizes */
-        Assert(flags & HASH_ELEM);
         if (info->keysize == sizeof(uint32))
             hashp->hash = uint32_hash;
         else
@@ -402,7 +418,6 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
          * it's almost certainly an integer or pointer not a string.
          */
         Assert(flags & HASH_STRINGS);
-        Assert(flags & HASH_ELEM);
         Assert(info->keysize > 8);

         hashp->hash = string_hash;
@@ -529,16 +544,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
         hctl->dsize = info->dsize;
     }

-    /*
-     * hash table now allocates space for key and data but you have to say how
-     * much space to allocate
-     */
-    if (flags & HASH_ELEM)
-    {
-        Assert(info->entrysize >= info->keysize);
-        hctl->keysize = info->keysize;
-        hctl->entrysize = info->entrysize;
-    }
+    /* remember the entry sizes, too */
+    hctl->keysize = info->keysize;
+    hctl->entrysize = info->entrysize;

     /* make local copies of heavily-used constant fields */
     hashp->keysize = hctl->keysize;
@@ -617,10 +625,6 @@ hdefault(HTAB *hashp)
     hctl->dsize = DEF_DIRSIZE;
     hctl->nsegs = 0;

-    /* rather pointless defaults for key & entry size */
-    hctl->keysize = sizeof(char *);
-    hctl->entrysize = 2 * sizeof(char *);
-
     hctl->num_partitions = 0;    /* not partitioned */

     /* table has no fixed maximum size */
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 666ad33567..c3daaae92b 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -124,7 +124,7 @@ typedef struct
  * one of these.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
-                         HASHCTL *info, int flags);
+                         const HASHCTL *info, int flags);
 extern void hash_destroy(HTAB *hashp);
 extern void hash_stats(const char *where, HTAB *hashp);
 extern void *hash_search(HTAB *hashp, const void *keyPtr, HASHACTION action,

Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

От

Tom Lane

Дата:

15 декабря 2020 г., 02:55:20

Here's a rolled-up patch that does some further documentation work
and gets rid of the unnecessary memset's as well.

            regards, tom lane

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 2dc9e44ae6..651227f510 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2607,7 +2607,8 @@ createConnHash(void)
     ctl.keysize = NAMEDATALEN;
     ctl.entrysize = sizeof(remoteConnHashEnt);

-    return hash_create("Remote Con hash", NUMCONN, &ctl, HASH_ELEM);
+    return hash_create("Remote Con hash", NUMCONN, &ctl,
+                       HASH_ELEM | HASH_STRINGS);
 }

 static void
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 70cfdb2c9d..2f00344b7f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -567,7 +567,6 @@ pgss_shmem_startup(void)
         pgss->stats.dealloc = 0;
     }

-    memset(&info, 0, sizeof(info));
     info.keysize = sizeof(pgssHashKey);
     info.entrysize = sizeof(pgssEntry);
     pgss_hash = ShmemInitHash("pg_stat_statements hash",
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ab3226287d..66581e5414 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -119,14 +119,11 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     {
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(ConnCacheKey);
         ctl.entrysize = sizeof(ConnCacheEntry);
-        /* allocate ConnectionHash in the cache context */
-        ctl.hcxt = CacheMemoryContext;
         ConnectionHash = hash_create("postgres_fdw connections", 8,
                                      &ctl,
-                                     HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                                     HASH_ELEM | HASH_BLOBS);

         /*
          * Register some callback functions that manage connection cleanup.
diff --git a/contrib/postgres_fdw/shippable.c b/contrib/postgres_fdw/shippable.c
index 3433c19712..b4766dc5ff 100644
--- a/contrib/postgres_fdw/shippable.c
+++ b/contrib/postgres_fdw/shippable.c
@@ -93,7 +93,6 @@ InitializeShippableCache(void)
     HASHCTL        ctl;

     /* Create the hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(ShippableCacheKey);
     ctl.entrysize = sizeof(ShippableCacheEntry);
     ShippableCacheHash =
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 85986ec24a..e9a9741154 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -714,7 +714,6 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
     MemoryContext SPIcontext;

     /* initialize the category hash table */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = MAX_CATNAME_LEN;
     ctl.entrysize = sizeof(crosstab_HashEnt);
     ctl.hcxt = per_query_ctx;
@@ -726,7 +725,7 @@ load_categories_hash(char *cats_sql, MemoryContext per_query_ctx)
     crosstab_hash = hash_create("crosstab hash",
                                 INIT_CATS,
                                 &ctl,
-                                HASH_ELEM | HASH_CONTEXT);
+                                HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

     /* Connect to SPI manager */
     if ((ret = SPI_connect()) < 0)
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4ad67c88b4..217c199a14 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -76,7 +76,6 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
      * nodeBuffersTab hash is association between index blocks and it's
      * buffers.
      */
-    memset(&hashCtl, 0, sizeof(hashCtl));
     hashCtl.keysize = sizeof(BlockNumber);
     hashCtl.entrysize = sizeof(GISTNodeBuffer);
     hashCtl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a664ecf494..c77a189907 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1363,7 +1363,6 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
     bool        found;

     /* Initialize hash tables used to track TIDs */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(ItemPointerData);
     hash_ctl.entrysize = sizeof(ItemPointerData);
     hash_ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 39e33763df..65942cc428 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -266,7 +266,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
     state->rs_cxt = rw_cxt;

     /* Initialize hash tables used to track update chains */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(TidHashKey);
     hash_ctl.entrysize = sizeof(UnresolvedTupData);
     hash_ctl.hcxt = state->rs_cxt;
@@ -824,7 +823,6 @@ logical_begin_heap_rewrite(RewriteState state)
     state->rs_begin_lsn = GetXLogInsertRecPtr();
     state->rs_num_rewrite_mappings = 0;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(TransactionId);
     hash_ctl.entrysize = sizeof(RewriteMappingFile);
     hash_ctl.hcxt = state->rs_cxt;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 32a3099c1f..e0ca3859a9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -113,7 +113,6 @@ log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
         /* create hash table when first needed */
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(xl_invalid_page_key);
         ctl.entrysize = sizeof(xl_invalid_page);

diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c
index 6a2c6685a0..f2e7bab62a 100644
--- a/src/backend/catalog/pg_enum.c
+++ b/src/backend/catalog/pg_enum.c
@@ -188,7 +188,6 @@ init_enum_blacklist(void)
 {
     HASHCTL        hash_ctl;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = TopTransactionContext;
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 17f37eb39f..5c3c78a0e6 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -171,7 +171,6 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
                *rel_numparents;
     ListCell   *l;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(SeenRelsEntry);
     ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index c0763c63e2..e04afd9963 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2375,7 +2375,6 @@ AddEventToPendingNotifies(Notification *n)
         ListCell   *l;

         /* Create the hash table */
-        MemSet(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(Notification *);
         hash_ctl.entrysize = sizeof(NotificationHash);
         hash_ctl.hash = notification_hash;
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 4b18be5b27..89087a7be3 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -406,15 +406,13 @@ InitQueryHashTable(void)
 {
     HASHCTL        hash_ctl;

-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-
     hash_ctl.keysize = NAMEDATALEN;
     hash_ctl.entrysize = sizeof(PreparedStatement);

     prepared_queries = hash_create("Prepared Queries",
                                    32,
                                    &hash_ctl,
-                                   HASH_ELEM);
+                                   HASH_ELEM | HASH_STRINGS);
 }

 /*
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 632b34af61..fa2eea8af2 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -1087,7 +1087,6 @@ create_seq_hashtable(void)
 {
     HASHCTL        ctl;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(SeqTableData);

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 86594bd056..97bfc8bd71 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -521,7 +521,6 @@ ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
     HTAB       *htab;
     int            i;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(SubplanResultRelHashElem);
     ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/nodes/extensible.c b/src/backend/nodes/extensible.c
index ab04459c55..3a6cfc44d3 100644
--- a/src/backend/nodes/extensible.c
+++ b/src/backend/nodes/extensible.c
@@ -47,11 +47,11 @@ RegisterExtensibleNodeEntry(HTAB **p_htable, const char *htable_label,
     {
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(HASHCTL));
         ctl.keysize = EXTNODENAME_MAX_LEN;
         ctl.entrysize = sizeof(ExtensibleNodeEntry);

-        *p_htable = hash_create(htable_label, 100, &ctl, HASH_ELEM);
+        *p_htable = hash_create(htable_label, 100, &ctl,
+                                HASH_ELEM | HASH_STRINGS);
     }

     if (strlen(extnodename) >= EXTNODENAME_MAX_LEN)
diff --git a/src/backend/optimizer/util/predtest.c b/src/backend/optimizer/util/predtest.c
index 0edd873dca..d6e83e5f8e 100644
--- a/src/backend/optimizer/util/predtest.c
+++ b/src/backend/optimizer/util/predtest.c
@@ -1982,7 +1982,6 @@ lookup_proof_cache(Oid pred_op, Oid clause_op, bool refute_it)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(OprProofCacheKey);
         ctl.entrysize = sizeof(OprProofCacheEntry);
         OprProofCacheHash = hash_create("Btree proof lookup cache", 256,
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 76245c1ff3..9c9a738c80 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -400,7 +400,6 @@ build_join_rel_hash(PlannerInfo *root)
     ListCell   *l;

     /* Create the hash table */
-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Relids);
     hash_ctl.entrysize = sizeof(JoinHashEntry);
     hash_ctl.hash = bitmap_hash;
diff --git a/src/backend/parser/parse_oper.c b/src/backend/parser/parse_oper.c
index 6613a3a8f8..e72d3676f1 100644
--- a/src/backend/parser/parse_oper.c
+++ b/src/backend/parser/parse_oper.c
@@ -999,7 +999,6 @@ find_oper_cache_entry(OprCacheKey *key)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(OprCacheKey);
         ctl.entrysize = sizeof(OprCacheEntry);
         OprCacheHash = hash_create("Operator lookup cache", 256,
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 9a292290ed..5b0a15ac0b 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -286,13 +286,13 @@ CreatePartitionDirectory(MemoryContext mcxt)
     PartitionDirectory pdir;
     HASHCTL        ctl;

-    MemSet(&ctl, 0, sizeof(HASHCTL));
+    pdir = palloc(sizeof(PartitionDirectoryData));
+    pdir->pdir_mcxt = mcxt;
+
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(PartitionDirectoryEntry);
     ctl.hcxt = mcxt;

-    pdir = palloc(sizeof(PartitionDirectoryData));
-    pdir->pdir_mcxt = mcxt;
     pdir->pdir_hash = hash_create("partition directory", 256, &ctl,
                                   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e28944d2f..ed127a1032 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2043,7 +2043,6 @@ do_autovacuum(void)
     pg_class_desc = CreateTupleDescCopy(RelationGetDescr(classRel));

     /* create hash table for toast <-> main relid mapping */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(av_relation);

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef..a62c6d4d0a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1161,7 +1161,6 @@ CompactCheckpointerRequestQueue(void)
     skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);

     /* Initialize temporary hash table */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(CheckpointerRequest);
     ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
     ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7c75a25d21..6b60f293e9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1265,7 +1265,6 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     HeapTuple    tup;
     Snapshot    snapshot;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
@@ -1815,7 +1814,6 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         /* First time through - initialize function stat table */
         HASHCTL        hash_ctl;

-        memset(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(Oid);
         hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
         pgStatFunctions = hash_create("Function stat entries",
@@ -1975,7 +1973,6 @@ get_tabstat_entry(Oid rel_id, bool isshared)
     {
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(TabStatHashEntry);

@@ -4994,7 +4991,6 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
     dbentry->tables = hash_create("Per-database table",
@@ -5423,7 +5419,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Create the DB hashtable
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
     hash_ctl.hcxt = pgStatLocalContext;
@@ -5608,7 +5603,6 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                         break;
                 }

-                memset(&hash_ctl, 0, sizeof(hash_ctl));
                 hash_ctl.keysize = sizeof(Oid);
                 hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
                 hash_ctl.hcxt = pgStatLocalContext;
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 07aa52977f..f4dbbbe2dd 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -111,7 +111,6 @@ logicalrep_relmap_init(void)
                                   ALLOCSET_DEFAULT_SIZES);

     /* Initialize the relation hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(LogicalRepRelId);
     ctl.entrysize = sizeof(LogicalRepRelMapEntry);
     ctl.hcxt = LogicalRepRelMapContext;
@@ -120,7 +119,6 @@ logicalrep_relmap_init(void)
                                    HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

     /* Initialize the type hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(LogicalRepTyp);
     ctl.hcxt = LogicalRepRelMapContext;
@@ -606,7 +604,6 @@ logicalrep_partmap_init(void)
                                   ALLOCSET_DEFAULT_SIZES);

     /* Initialize the relation hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);    /* partition OID */
     ctl.entrysize = sizeof(LogicalRepPartMapEntry);
     ctl.hcxt = LogicalRepPartMapContext;
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 15dc51a94d..7359fa9df2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1619,8 +1619,6 @@ ReorderBufferBuildTupleCidHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
     if (!rbtxn_has_catalog_changes(txn) || dlist_is_empty(&txn->tuplecids))
         return;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-
     hash_ctl.keysize = sizeof(ReorderBufferTupleCidKey);
     hash_ctl.entrysize = sizeof(ReorderBufferTupleCidEnt);
     hash_ctl.hcxt = rb->context;
@@ -4116,7 +4114,6 @@ ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn)

     Assert(txn->toast_hash == NULL);

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(ReorderBufferToastEnt);
     hash_ctl.hcxt = rb->context;
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 1904f3471c..6259606537 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -372,7 +372,6 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     {
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(struct tablesync_start_time_mapping);
         last_start_times = hash_create("Logical replication table sync worker start times",
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 9c997aed83..49d25b02d7 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -867,22 +867,18 @@ static void
 init_rel_sync_cache(MemoryContext cachectx)
 {
     HASHCTL        ctl;
-    MemoryContext old_ctxt;

     if (RelationSyncCache != NULL)
         return;

     /* Make a new hash table for the cache */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RelationSyncEntry);
     ctl.hcxt = cachectx;

-    old_ctxt = MemoryContextSwitchTo(cachectx);
     RelationSyncCache = hash_create("logical replication output relation cache",
                                     128, &ctl,
                                     HASH_ELEM | HASH_CONTEXT | HASH_BLOBS);
-    (void) MemoryContextSwitchTo(old_ctxt);

     Assert(RelationSyncCache != NULL);

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..c5e8707151 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2505,7 +2505,6 @@ InitBufferPoolAccess(void)

     memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));

-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(int32);
     hash_ctl.entrysize = sizeof(PrivateRefCountEntry);

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6ffd7b3306..cd3475e9e1 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -465,7 +465,6 @@ InitLocalBuffers(void)
     }

     /* Create the lookup hash table */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(BufferTag);
     info.entrysize = sizeof(LocalBufferLookupEnt);

diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0c2094f766..8700f7f19a 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -30,7 +30,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,

 typedef struct
 {
-    char        oid[OIDCHARS + 1];
+    Oid            reloid;            /* hash key */
 } unlogged_relation_entry;

 /*
@@ -172,10 +172,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
          * need to be reset.  Otherwise, this cleanup operation would be
          * O(n^2).
          */
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(unlogged_relation_entry);
+        ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(unlogged_relation_entry);
-        hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
+        ctl.hcxt = CurrentMemoryContext;
+        hash = hash_create("unlogged relation OIDs", 32, &ctl,
+                           HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);

         /* Scan the directory. */
         dbspace_dir = AllocateDir(dbspacedirname);
@@ -198,9 +199,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
              * Put the OID portion of the name into the hash table, if it
              * isn't already.
              */
-            memset(ent.oid, 0, sizeof(ent.oid));
-            memcpy(ent.oid, de->d_name, oidchars);
-            hash_search(hash, &ent, HASH_ENTER, NULL);
+            ent.reloid = atooid(de->d_name);
+            (void) hash_search(hash, &ent, HASH_ENTER, NULL);
         }

         /* Done with the first pass. */
@@ -224,7 +224,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
         {
             ForkNumber    forkNum;
             int            oidchars;
-            bool        found;
             unlogged_relation_entry ent;

             /* Skip anything that doesn't look like a relation data file. */
@@ -238,14 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)

             /*
              * See whether the OID portion of the name shows up in the hash
-             * table.
+             * table.  If so, nuke it!
              */
-            memset(ent.oid, 0, sizeof(ent.oid));
-            memcpy(ent.oid, de->d_name, oidchars);
-            hash_search(hash, &ent, HASH_FIND, &found);
-
-            /* If so, nuke it! */
-            if (found)
+            ent.reloid = atooid(de->d_name);
+            if (hash_search(hash, &ent, HASH_FIND, NULL))
             {
                 snprintf(rm_path, sizeof(rm_path), "%s/%s",
                          dbspacedirname, de->d_name);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 97716f6aef..b0fc9f160d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -292,7 +292,6 @@ void
 InitShmemIndex(void)
 {
     HASHCTL        info;
-    int            hash_flags;

     /*
      * Create the shared memory shmem index.
@@ -304,11 +303,11 @@ InitShmemIndex(void)
      */
     info.keysize = SHMEM_INDEX_KEYSIZE;
     info.entrysize = sizeof(ShmemIndexEnt);
-    hash_flags = HASH_ELEM;

     ShmemIndex = ShmemInitHash("ShmemIndex",
                                SHMEM_INDEX_SIZE, SHMEM_INDEX_SIZE,
-                               &info, hash_flags);
+                               &info,
+                               HASH_ELEM | HASH_STRINGS);
 }

 /*
@@ -329,6 +328,11 @@ InitShmemIndex(void)
  * whose maximum size is certain, this should be equal to max_size; that
  * ensures that no run-time out-of-shared-memory failures can occur.
  *
+ * *infoP and hash_flags should specify at least the entry sizes and key
+ * comparison semantics (see hash_create()).  Flag bits and values specific
+ * to shared-memory hash tables are added here, except that callers may
+ * choose to specify HASH_PARTITION and/or HASH_FIXED_SIZE.
+ *
  * Note: before Postgres 9.0, this function returned NULL for some failure
  * cases.  Now, it always throws error instead, so callers need not check
  * for NULL.
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 52b2809dac..4ea3cf1f5c 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -81,7 +81,6 @@ InitRecoveryTransactionEnvironment(void)
      * Initialize the hash table for tracking the list of locks held by each
      * transaction.
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(TransactionId);
     hash_ctl.entrysize = sizeof(RecoveryLockListsEntry);
     RecoveryLockLists = hash_create("RecoveryLockLists",
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index d86566f455..53472dd21e 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -419,7 +419,6 @@ InitLocks(void)
      * Allocate hash table for LOCK structs.  This stores per-locked-object
      * information.
      */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(LOCKTAG);
     info.entrysize = sizeof(LOCK);
     info.num_partitions = NUM_LOCK_PARTITIONS;
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 108e652179..26bcce9735 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -342,7 +342,6 @@ init_lwlock_stats(void)
                                              ALLOCSET_DEFAULT_SIZES);
     MemoryContextAllowInCriticalSection(lwlock_stats_cxt, true);

-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(lwlock_stats_key);
     ctl.entrysize = sizeof(lwlock_stats);
     ctl.hcxt = lwlock_stats_cxt;
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 8a365b400c..e42e131543 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1096,7 +1096,6 @@ InitPredicateLocks(void)
      * Allocate hash table for PREDICATELOCKTARGET structs.  This stores
      * per-predicate-lock-target information.
      */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(PREDICATELOCKTARGETTAG);
     info.entrysize = sizeof(PREDICATELOCKTARGET);
     info.num_partitions = NUM_PREDICATELOCK_PARTITIONS;
@@ -1129,7 +1128,6 @@ InitPredicateLocks(void)
      * Allocate hash table for PREDICATELOCK structs.  This stores per
      * xact-lock-of-a-target information.
      */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(PREDICATELOCKTAG);
     info.entrysize = sizeof(PREDICATELOCK);
     info.hash = predicatelock_hash;
@@ -1212,7 +1210,6 @@ InitPredicateLocks(void)
      * Allocate hash table for SERIALIZABLEXID structs.  This stores per-xid
      * information for serializable transactions which have accessed data.
      */
-    MemSet(&info, 0, sizeof(info));
     info.keysize = sizeof(SERIALIZABLEXIDTAG);
     info.entrysize = sizeof(SERIALIZABLEXID);

@@ -1853,7 +1850,6 @@ CreateLocalPredicateLockHash(void)

     /* Initialize the backend-local hash table of parent locks */
     Assert(LocalPredicateLockHash == NULL);
-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(PREDICATELOCKTARGETTAG);
     hash_ctl.entrysize = sizeof(LOCALPREDICATELOCK);
     LocalPredicateLockHash = hash_create("Local predicate lock",
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..072bdd118f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -154,7 +154,6 @@ smgropen(RelFileNode rnode, BackendId backend)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(RelFileNodeBackend);
         ctl.entrysize = sizeof(SMgrRelationData);
         SMgrRelationHash = hash_create("smgr relation table", 400,
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 1d635d596c..a49588f6b9 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -150,7 +150,6 @@ InitSync(void)
                                               ALLOCSET_DEFAULT_SIZES);
         MemoryContextAllowInCriticalSection(pendingOpsCxt, true);

-        MemSet(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(FileTag);
         hash_ctl.entrysize = sizeof(PendingFsyncEntry);
         hash_ctl.hcxt = pendingOpsCxt;
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 2eed0cd137..19e9611a3a 100644
--- a/src/backend/tsearch/ts_typanalyze.c
+++ b/src/backend/tsearch/ts_typanalyze.c
@@ -180,7 +180,6 @@ compute_tsvector_stats(VacAttrStats *stats,
      * worry about overflowing the initial size. Also we don't need to pay any
      * attention to locking and memory management.
      */
-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(LexemeHashKey);
     hash_ctl.entrysize = sizeof(TrackItem);
     hash_ctl.hash = lexeme_hash;
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 4912cabc61..cb2a834193 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -277,7 +277,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
      * worry about overflowing the initial size. Also we don't need to pay any
      * attention to locking and memory management.
      */
-    MemSet(&elem_hash_ctl, 0, sizeof(elem_hash_ctl));
     elem_hash_ctl.keysize = sizeof(Datum);
     elem_hash_ctl.entrysize = sizeof(TrackItem);
     elem_hash_ctl.hash = element_hash;
@@ -289,7 +288,6 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
                                HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);

     /* hashtable for array distinct elements counts */
-    MemSet(&count_hash_ctl, 0, sizeof(count_hash_ctl));
     count_hash_ctl.keysize = sizeof(int);
     count_hash_ctl.entrysize = sizeof(DECountItem);
     count_hash_ctl.hcxt = CurrentMemoryContext;
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 12557ce3af..7a25415078 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -3439,14 +3439,13 @@ get_json_object_as_hash(char *json, int len, const char *funcname)
     JsonLexContext *lex = makeJsonLexContextCstringLen(json, len, GetDatabaseEncoding(), true);
     JsonSemAction *sem;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = NAMEDATALEN;
     ctl.entrysize = sizeof(JsonHashEntry);
     ctl.hcxt = CurrentMemoryContext;
     tab = hash_create("json object hashtable",
                       100,
                       &ctl,
-                      HASH_ELEM | HASH_CONTEXT);
+                      HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);

     state = palloc0(sizeof(JHashState));
     sem = palloc0(sizeof(JsonSemAction));
@@ -3831,14 +3830,13 @@ populate_recordset_object_start(void *state)
         return;

     /* Object at level 1: set up a new hash table for this object */
-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = NAMEDATALEN;
     ctl.entrysize = sizeof(JsonHashEntry);
     ctl.hcxt = CurrentMemoryContext;
     _state->json_hash = hash_create("json object hashtable",
                                     100,
                                     &ctl,
-                                    HASH_ELEM | HASH_CONTEXT);
+                                    HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
 }

 static void
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index b6d05ac98d..c39d67645c 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1297,7 +1297,6 @@ lookup_collation_cache(Oid collation, bool set_flags)
         /* First time through, initialize the hash table */
         HASHCTL        ctl;

-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(collation_cache_entry);
         collation_cache = hash_create("Collation cache", 100, &ctl,
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 02b1a3868f..5ab134a853 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -2540,7 +2540,6 @@ ri_InitHashTables(void)
 {
     HASHCTL        ctl;

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RI_ConstraintInfo);
     ri_constraint_cache = hash_create("RI constraint cache",
@@ -2552,14 +2551,12 @@ ri_InitHashTables(void)
                                   InvalidateConstraintCacheCallBack,
                                   (Datum) 0);

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(RI_QueryKey);
     ctl.entrysize = sizeof(RI_QueryHashEntry);
     ri_query_cache = hash_create("RI query cache",
                                  RI_INIT_QUERYHASHSIZE,
                                  &ctl, HASH_ELEM | HASH_BLOBS);

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(RI_CompareKey);
     ctl.entrysize = sizeof(RI_CompareHashEntry);
     ri_compare_cache = hash_create("RI compare cache",
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ad582f99a5..7d4443e807 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -3464,14 +3464,14 @@ set_rtable_names(deparse_namespace *dpns, List *parent_namespaces,
      * We use a hash table to hold known names, so that this process is O(N)
      * not O(N^2) for N names.
      */
-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = NAMEDATALEN;
     hash_ctl.entrysize = sizeof(NameHashEntry);
     hash_ctl.hcxt = CurrentMemoryContext;
     names_hash = hash_create("set_rtable_names names",
                              list_length(dpns->rtable),
                              &hash_ctl,
-                             HASH_ELEM | HASH_CONTEXT);
+                             HASH_ELEM | HASH_STRINGS | HASH_CONTEXT);
+
     /* Preload the hash table with names appearing in parent_namespaces */
     foreach(lc, parent_namespaces)
     {
diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c
index 05ac366b40..934a84e03f 100644
--- a/src/backend/utils/cache/attoptcache.c
+++ b/src/backend/utils/cache/attoptcache.c
@@ -79,7 +79,6 @@ InitializeAttoptCache(void)
     HASHCTL        ctl;

     /* Initialize the hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(AttoptCacheKey);
     ctl.entrysize = sizeof(AttoptCacheEntry);
     AttoptCacheHash =
diff --git a/src/backend/utils/cache/evtcache.c b/src/backend/utils/cache/evtcache.c
index 0427795395..0877bc7e0e 100644
--- a/src/backend/utils/cache/evtcache.c
+++ b/src/backend/utils/cache/evtcache.c
@@ -118,7 +118,6 @@ BuildEventTriggerCache(void)
     EventTriggerCacheState = ETCS_REBUILD_STARTED;

     /* Create new hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(EventTriggerEvent);
     ctl.entrysize = sizeof(EventTriggerCacheEntry);
     ctl.hcxt = EventTriggerCacheContext;
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66393becfb..3bd5e18042 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1607,7 +1607,6 @@ LookupOpclassInfo(Oid operatorClassOid,
         /* First time through: initialize the opclass cache */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(OpClassCacheEnt);
         OpClassCache = hash_create("Operator class cache", 64,
@@ -3775,7 +3774,6 @@ RelationCacheInitialize(void)
     /*
      * create hashtable that indexes the relcache
      */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(RelIdCacheEnt);
     RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE,
diff --git a/src/backend/utils/cache/relfilenodemap.c b/src/backend/utils/cache/relfilenodemap.c
index 0dbdbff603..38e6379974 100644
--- a/src/backend/utils/cache/relfilenodemap.c
+++ b/src/backend/utils/cache/relfilenodemap.c
@@ -110,17 +110,15 @@ InitializeRelfilenodeMap(void)
     relfilenode_skey[0].sk_attno = Anum_pg_class_reltablespace;
     relfilenode_skey[1].sk_attno = Anum_pg_class_relfilenode;

-    /* Initialize the hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
-    ctl.keysize = sizeof(RelfilenodeMapKey);
-    ctl.entrysize = sizeof(RelfilenodeMapEntry);
-    ctl.hcxt = CacheMemoryContext;
-
     /*
      * Only create the RelfilenodeMapHash now, so we don't end up partially
      * initialized when fmgr_info_cxt() above ERRORs out with an out of memory
      * error.
      */
+    ctl.keysize = sizeof(RelfilenodeMapKey);
+    ctl.entrysize = sizeof(RelfilenodeMapEntry);
+    ctl.hcxt = CacheMemoryContext;
+
     RelfilenodeMapHash =
         hash_create("RelfilenodeMap cache", 64, &ctl,
                     HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index e0c3c1b1c1..c8387e2541 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -79,7 +79,6 @@ InitializeTableSpaceCache(void)
     HASHCTL        ctl;

     /* Initialize the hash table. */
-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(TableSpaceCacheEntry);
     TableSpaceCacheHash =
diff --git a/src/backend/utils/cache/ts_cache.c b/src/backend/utils/cache/ts_cache.c
index f9f7912cb8..a2867fac7d 100644
--- a/src/backend/utils/cache/ts_cache.c
+++ b/src/backend/utils/cache/ts_cache.c
@@ -117,7 +117,6 @@ lookup_ts_parser_cache(Oid prsId)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(TSParserCacheEntry);
         TSParserCacheHash = hash_create("Tsearch parser cache", 4,
@@ -215,7 +214,6 @@ lookup_ts_dictionary_cache(Oid dictId)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(TSDictionaryCacheEntry);
         TSDictionaryCacheHash = hash_create("Tsearch dictionary cache", 8,
@@ -365,7 +363,6 @@ init_ts_config_cache(void)
 {
     HASHCTL        ctl;

-    MemSet(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(Oid);
     ctl.entrysize = sizeof(TSConfigCacheEntry);
     TSConfigCacheHash = hash_create("Tsearch configuration cache", 16,
diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index 5883fde367..1e331098c0 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -341,7 +341,6 @@ lookup_type_cache(Oid type_id, int flags)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
         ctl.entrysize = sizeof(TypeCacheEntry);
         TypeCacheHash = hash_create("Type information cache", 64,
@@ -1874,7 +1873,6 @@ assign_record_type_typmod(TupleDesc tupDesc)
         /* First time through: initialize the hash table */
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(TupleDesc);    /* just the pointer */
         ctl.entrysize = sizeof(RecordCacheEntry);
         ctl.hash = record_type_typmod_hash;
diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c
index bd779fdaf7..adb31e109f 100644
--- a/src/backend/utils/fmgr/dfmgr.c
+++ b/src/backend/utils/fmgr/dfmgr.c
@@ -680,13 +680,12 @@ find_rendezvous_variable(const char *varName)
     {
         HASHCTL        ctl;

-        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = NAMEDATALEN;
         ctl.entrysize = sizeof(rendezvousHashEntry);
         rendezvousHash = hash_create("Rendezvous variable hash",
                                      16,
                                      &ctl,
-                                     HASH_ELEM);
+                                     HASH_ELEM | HASH_STRINGS);
     }

     /* Find or create the hashtable entry for this varName */
diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index 2681b7fbc6..fa5f7ac615 100644
--- a/src/backend/utils/fmgr/fmgr.c
+++ b/src/backend/utils/fmgr/fmgr.c
@@ -565,7 +565,6 @@ record_C_func(HeapTuple procedureTuple,
     {
         HASHCTL        hash_ctl;

-        MemSet(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(Oid);
         hash_ctl.entrysize = sizeof(CFuncHashTabEntry);
         CFuncHash = hash_create("CFuncHash",
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index d14d875c93..fbd849b8f7 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -30,11 +30,12 @@
  * dynahash.c provides support for these types of lookup keys:
  *
  * 1. Null-terminated C strings (truncated if necessary to fit in keysize),
- * compared as though by strcmp().  This is the default behavior.
+ * compared as though by strcmp().  This is selected by specifying the
+ * HASH_STRINGS flag to hash_create.
  *
  * 2. Arbitrary binary data of size keysize, compared as though by memcmp().
  * (Caller must ensure there are no undefined padding bits in the keys!)
- * This is selected by specifying HASH_BLOBS flag to hash_create.
+ * This is selected by specifying the HASH_BLOBS flag to hash_create.
  *
  * 3. More complex key behavior can be selected by specifying user-supplied
  * hashing, comparison, and/or key-copying functions.  At least a hashing
@@ -47,8 +48,8 @@
  *   locks.
  * - Shared memory hashes are allocated in a fixed size area at startup and
  *   are discoverable by name from other processes.
- * - Because entries don't need to be moved in the case of hash conflicts, has
- *   better performance for large entries
+ * - Because entries don't need to be moved in the case of hash conflicts,
+ *   dynahash has better performance for large entries.
  * - Guarantees stable pointers to entries.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
@@ -316,6 +317,28 @@ string_compare(const char *key1, const char *key2, Size keysize)
  *    *info: additional table parameters, as indicated by flags
  *    flags: bitmask indicating which parameters to take from *info
  *
+ * The flags value *must* include HASH_ELEM.  (Formerly, this was nominally
+ * optional, but the default keysize and entrysize values were useless.)
+ * The flags value must also include exactly one of HASH_STRINGS, HASH_BLOBS,
+ * or HASH_FUNCTION, to define the key hashing semantics (C strings,
+ * binary blobs, or custom, respectively).  Callers specifying a custom
+ * hash function will likely also want to use HASH_COMPARE, and perhaps
+ * also HASH_KEYCOPY, to control key comparison and copying.
+ * Another often-used flag is HASH_CONTEXT, to allocate the hash table
+ * under info->hcxt rather than under TopMemoryContext; the default
+ * behavior is only suitable for session-lifespan hash tables.
+ * Other flags bits are special-purpose and seldom used, except for those
+ * associated with shared-memory hash tables, for which see ShmemInitHash().
+ *
+ * Fields in *info are read only when the associated flags bit is set.
+ * It is not necessary to initialize other fields of *info.
+ * Neither tabname nor *info need persist after the hash_create() call.
+ *
+ * Note: It is deprecated for callers of hash_create() to explicitly specify
+ * string_hash, tag_hash, uint32_hash, or oid_hash.  Just set HASH_BLOBS or
+ * HASH_STRINGS.  Use HASH_FUNCTION only when you want something other than
+ * one of these.
+ *
  * Note: for a shared-memory hashtable, nelem needs to be a pretty good
  * estimate, since we can't expand the table on the fly.  But an unshared
  * hashtable can be expanded on-the-fly, so it's better for nelem to be
@@ -323,11 +346,19 @@ string_compare(const char *key1, const char *key2, Size keysize)
  * large nelem will penalize hash_seq_search speed without buying much.
  */
 HTAB *
-hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
+hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
     HTAB       *hashp;
     HASHHDR    *hctl;

+    /*
+     * Hash tables now allocate space for key and data, but you have to say
+     * how much space to allocate.
+     */
+    Assert(flags & HASH_ELEM);
+    Assert(info->keysize > 0);
+    Assert(info->entrysize >= info->keysize);
+
     /*
      * For shared hash tables, we have a local hash header (HTAB struct) that
      * we allocate in TopMemoryContext; all else is in shared memory.
@@ -370,28 +401,43 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
      * Select the appropriate hash function (see comments at head of file).
      */
     if (flags & HASH_FUNCTION)
+    {
+        Assert(!(flags & (HASH_BLOBS | HASH_STRINGS)));
         hashp->hash = info->hash;
+    }
     else if (flags & HASH_BLOBS)
     {
+        Assert(!(flags & HASH_STRINGS));
         /* We can optimize hashing for common key sizes */
-        Assert(flags & HASH_ELEM);
         if (info->keysize == sizeof(uint32))
             hashp->hash = uint32_hash;
         else
             hashp->hash = tag_hash;
     }
     else
-        hashp->hash = string_hash;    /* default hash function */
+    {
+        /*
+         * string_hash used to be considered the default hash method, and in a
+         * non-assert build it effectively still is.  But we now consider it
+         * an assertion error to not say HASH_STRINGS explicitly.  To help
+         * catch mistaken usage of HASH_STRINGS, we also insist on a
+         * reasonably long string length: if the keysize is only 4 or 8 bytes,
+         * it's almost certainly an integer or pointer not a string.
+         */
+        Assert(flags & HASH_STRINGS);
+        Assert(info->keysize > 8);
+
+        hashp->hash = string_hash;
+    }

     /*
      * If you don't specify a match function, it defaults to string_compare if
-     * you used string_hash (either explicitly or by default) and to memcmp
-     * otherwise.
+     * you used string_hash, and to memcmp otherwise.
      *
      * Note: explicitly specifying string_hash is deprecated, because this
      * might not work for callers in loadable modules on some platforms due to
      * referencing a trampoline instead of the string_hash function proper.
-     * Just let it default, eh?
+     * Specify HASH_STRINGS instead.
      */
     if (flags & HASH_COMPARE)
         hashp->match = info->match;
@@ -505,16 +551,9 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
         hctl->dsize = info->dsize;
     }

-    /*
-     * hash table now allocates space for key and data but you have to say how
-     * much space to allocate
-     */
-    if (flags & HASH_ELEM)
-    {
-        Assert(info->entrysize >= info->keysize);
-        hctl->keysize = info->keysize;
-        hctl->entrysize = info->entrysize;
-    }
+    /* remember the entry sizes, too */
+    hctl->keysize = info->keysize;
+    hctl->entrysize = info->entrysize;

     /* make local copies of heavily-used constant fields */
     hashp->keysize = hctl->keysize;
@@ -593,10 +632,6 @@ hdefault(HTAB *hashp)
     hctl->dsize = DEF_DIRSIZE;
     hctl->nsegs = 0;

-    /* rather pointless defaults for key & entry size */
-    hctl->keysize = sizeof(char *);
-    hctl->entrysize = 2 * sizeof(char *);
-
     hctl->num_partitions = 0;    /* not partitioned */

     /* table has no fixed maximum size */
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index ec6f80ee99..283dfe2d9e 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -119,7 +119,7 @@ EnablePortalManager(void)
      * create, initially
      */
     PortalHashTable = hash_create("Portal hash", PORTALS_PER_USER,
-                                  &ctl, HASH_ELEM);
+                                  &ctl, HASH_ELEM | HASH_STRINGS);
 }

 /*
diff --git a/src/backend/utils/time/combocid.c b/src/backend/utils/time/combocid.c
index 4ee9ef0ffe..9626f98100 100644
--- a/src/backend/utils/time/combocid.c
+++ b/src/backend/utils/time/combocid.c
@@ -223,7 +223,6 @@ GetComboCommandId(CommandId cmin, CommandId cmax)
         sizeComboCids = CCID_ARRAY_SIZE;
         usedComboCids = 0;

-        memset(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = sizeof(ComboCidKeyData);
         hash_ctl.entrysize = sizeof(ComboCidEntryData);
         hash_ctl.hcxt = TopTransactionContext;
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index bebf89b3c4..13c6602217 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -64,25 +64,36 @@ typedef struct HTAB HTAB;
 /* Only those fields indicated by hash_flags need be set */
 typedef struct HASHCTL
 {
+    /* Used if HASH_PARTITION flag is set: */
     long        num_partitions; /* # partitions (must be power of 2) */
+    /* Used if HASH_SEGMENT flag is set: */
     long        ssize;            /* segment size */
+    /* Used if HASH_DIRSIZE flag is set: */
     long        dsize;            /* (initial) directory size */
     long        max_dsize;        /* limit to dsize if dir size is limited */
+    /* Used if HASH_ELEM flag is set (which is now required): */
     Size        keysize;        /* hash key length in bytes */
     Size        entrysize;        /* total user element size in bytes */
+    /* Used if HASH_FUNCTION flag is set: */
     HashValueFunc hash;            /* hash function */
+    /* Used if HASH_COMPARE flag is set: */
     HashCompareFunc match;        /* key comparison function */
+    /* Used if HASH_KEYCOPY flag is set: */
     HashCopyFunc keycopy;        /* key copying function */
+    /* Used if HASH_ALLOC flag is set: */
     HashAllocFunc alloc;        /* memory allocator */
+    /* Used if HASH_CONTEXT flag is set: */
     MemoryContext hcxt;            /* memory context to use for allocations */
+    /* Used if HASH_SHARED_MEM flag is set: */
     HASHHDR    *hctl;            /* location of header in shared mem */
 } HASHCTL;

-/* Flags to indicate which parameters are supplied */
+/* Flag bits for hash_create; most indicate which parameters are supplied */
 #define HASH_PARTITION    0x0001    /* Hashtable is used w/partitioned locking */
 #define HASH_SEGMENT    0x0002    /* Set segment size */
 #define HASH_DIRSIZE    0x0004    /* Set directory size (initial and max) */
-#define HASH_ELEM        0x0010    /* Set keysize and entrysize */
+#define HASH_ELEM        0x0008    /* Set keysize and entrysize (now required!) */
+#define HASH_STRINGS    0x0010    /* Select support functions for string keys */
 #define HASH_BLOBS        0x0020    /* Select support functions for binary keys */
 #define HASH_FUNCTION    0x0040    /* Set user defined hash function */
 #define HASH_COMPARE    0x0080    /* Set user defined comparison function */
@@ -93,7 +104,6 @@ typedef struct HASHCTL
 #define HASH_ATTACH        0x1000    /* Do not initialize hctl */
 #define HASH_FIXED_SIZE 0x2000    /* Initial size is a hard limit */

-
 /* max_dsize value to indicate expansible directory */
 #define NO_MAX_DSIZE            (-1)

@@ -116,13 +126,9 @@ typedef struct

 /*
  * prototypes for functions in dynahash.c
- *
- * Note: It is deprecated for callers of hash_create to explicitly specify
- * string_hash, tag_hash, uint32_hash, or oid_hash.  Just set HASH_BLOBS or
- * not.  Use HASH_FUNCTION only when you want something other than those.
  */
 extern HTAB *hash_create(const char *tabname, long nelem,
-                         HASHCTL *info, int flags);
+                         const HASHCTL *info, int flags);
 extern void hash_destroy(HTAB *hashp);
 extern void hash_stats(const char *where, HTAB *hashp);
 extern void *hash_search(HTAB *hashp, const void *keyPtr, HASHACTION action,
diff --git a/src/pl/plperl/plperl.c b/src/pl/plperl/plperl.c
index 4de756455d..6299adf71a 100644
--- a/src/pl/plperl/plperl.c
+++ b/src/pl/plperl/plperl.c
@@ -458,7 +458,6 @@ _PG_init(void)
     /*
      * Create hash tables.
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(plperl_interp_desc);
     plperl_interp_hash = hash_create("PL/Perl interpreters",
@@ -466,7 +465,6 @@ _PG_init(void)
                                      &hash_ctl,
                                      HASH_ELEM | HASH_BLOBS);

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(plperl_proc_key);
     hash_ctl.entrysize = sizeof(plperl_proc_ptr);
     plperl_proc_hash = hash_create("PL/Perl procedures",
@@ -580,13 +578,12 @@ select_perl_context(bool trusted)
     {
         HASHCTL        hash_ctl;

-        memset(&hash_ctl, 0, sizeof(hash_ctl));
         hash_ctl.keysize = NAMEDATALEN;
         hash_ctl.entrysize = sizeof(plperl_query_entry);
         interp_desc->query_hash = hash_create("PL/Perl queries",
                                               32,
                                               &hash_ctl,
-                                              HASH_ELEM);
+                                              HASH_ELEM | HASH_STRINGS);
     }

     /*
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index b610b28d70..555da952e1 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -2567,7 +2567,6 @@ plpgsql_HashTableInit(void)
     /* don't allow double-initialization */
     Assert(plpgsql_HashTable == NULL);

-    memset(&ctl, 0, sizeof(ctl));
     ctl.keysize = sizeof(PLpgSQL_func_hashkey);
     ctl.entrysize = sizeof(plpgsql_HashEnt);
     plpgsql_HashTable = hash_create("PLpgSQL function hash",
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index ccbc50fc45..112f6ab0ae 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -4058,7 +4058,6 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
     {
         estate->simple_eval_estate = simple_eval_estate;
         /* Private cast hash just lives in function's main context */
-        memset(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(plpgsql_CastHashKey);
         ctl.entrysize = sizeof(plpgsql_CastHashEntry);
         ctl.hcxt = CurrentMemoryContext;
@@ -4077,7 +4076,6 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
             shared_cast_context = AllocSetContextCreate(TopMemoryContext,
                                                         "PLpgSQL cast info",
                                                         ALLOCSET_DEFAULT_SIZES);
-            memset(&ctl, 0, sizeof(ctl));
             ctl.keysize = sizeof(plpgsql_CastHashKey);
             ctl.entrysize = sizeof(plpgsql_CastHashEntry);
             ctl.hcxt = shared_cast_context;
diff --git a/src/pl/plpython/plpy_plpymodule.c b/src/pl/plpython/plpy_plpymodule.c
index 7f54d093ac..0365acc95b 100644
--- a/src/pl/plpython/plpy_plpymodule.c
+++ b/src/pl/plpython/plpy_plpymodule.c
@@ -214,7 +214,6 @@ PLy_add_exceptions(PyObject *plpy)
     PLy_exc_spi_error = PLy_create_exception("plpy.SPIError", NULL, NULL,
                                              "SPIError", plpy);

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(int);
     hash_ctl.entrysize = sizeof(PLyExceptionEntry);
     PLy_spi_exceptions = hash_create("PL/Python SPI exceptions", 256,
diff --git a/src/pl/plpython/plpy_procedure.c b/src/pl/plpython/plpy_procedure.c
index 1f05c633ef..b7c0b5cebe 100644
--- a/src/pl/plpython/plpy_procedure.c
+++ b/src/pl/plpython/plpy_procedure.c
@@ -34,7 +34,6 @@ init_procedure_caches(void)
 {
     HASHCTL        hash_ctl;

-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(PLyProcedureKey);
     hash_ctl.entrysize = sizeof(PLyProcedureEntry);
     PLy_procedure_cache = hash_create("PL/Python procedures", 32, &hash_ctl,
diff --git a/src/pl/tcl/pltcl.c b/src/pl/tcl/pltcl.c
index a3a2dc8e89..e11837559d 100644
--- a/src/pl/tcl/pltcl.c
+++ b/src/pl/tcl/pltcl.c
@@ -439,7 +439,6 @@ _PG_init(void)
     /************************************************************
      * Create the hash table for working interpreters
      ************************************************************/
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(Oid);
     hash_ctl.entrysize = sizeof(pltcl_interp_desc);
     pltcl_interp_htab = hash_create("PL/Tcl interpreters",
@@ -450,7 +449,6 @@ _PG_init(void)
     /************************************************************
      * Create the hash table for function lookup
      ************************************************************/
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
     hash_ctl.keysize = sizeof(pltcl_proc_key);
     hash_ctl.entrysize = sizeof(pltcl_proc_ptr);
     pltcl_proc_htab = hash_create("PL/Tcl functions",
diff --git a/src/timezone/pgtz.c b/src/timezone/pgtz.c
index 3f0fb51e91..4a360f5077 100644
--- a/src/timezone/pgtz.c
+++ b/src/timezone/pgtz.c
@@ -203,15 +203,13 @@ init_timezone_hashtable(void)
 {
     HASHCTL        hash_ctl;

-    MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-
     hash_ctl.keysize = TZ_STRLEN_MAX + 1;
     hash_ctl.entrysize = sizeof(pg_tz_cache);

     timezone_cache = hash_create("Timezones",
                                  4,
                                  &hash_ctl,
-                                 HASH_ELEM);
+                                 HASH_ELEM | HASH_STRINGS);
     if (!timezone_cache)
         return false;

Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

От

Noah Misch

Дата:

15 декабря 2020 г., 11:20:30

On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote:
> * Should we just have a blanket insistence that all callers supply
> HASH_ELEM?  The default sizes that dynahash.c uses without that are
> undocumented and basically useless.

+1

> we should just rip out all those memsets as pointless, since there's
> basically no case where you'd use the memset to fill a field that
> you meant to pass as zero.  The fact that hash_create() doesn't
> read fields it's not told to by a flag means we should not need
> the memsets to avoid uninitialized-memory reads.

On Mon, Dec 14, 2020 at 06:55:20PM -0500, Tom Lane wrote:
> Here's a rolled-up patch that does some further documentation work
> and gets rid of the unnecessary memset's as well.

+1 on removing the memset() calls.  That said, it's not a big deal if more
creep in over time; it doesn't qualify as a project policy violation.

> @@ -329,6 +328,11 @@ InitShmemIndex(void)
>   * whose maximum size is certain, this should be equal to max_size; that
>   * ensures that no run-time out-of-shared-memory failures can occur.
>   *
> + * *infoP and hash_flags should specify at least the entry sizes and key

s/should/must/

Re: HASH_BLOBS hazards (was Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions)

От

Tom Lane

Дата:

15 декабря 2020 г., 17:23:41

Noah Misch <noah@leadboat.com> writes:
> On Mon, Dec 14, 2020 at 01:59:03PM -0500, Tom Lane wrote:
>> Here's a rolled-up patch that does some further documentation work
>> and gets rid of the unnecessary memset's as well.

> +1 on removing the memset() calls.  That said, it's not a big deal if more
> creep in over time; it doesn't qualify as a project policy violation.

Right, that part is just neatnik-ism.  Neither the calls with memset
nor the ones without are buggy.

>> + * *infoP and hash_flags should specify at least the entry sizes and key

> s/should/must/

OK; thanks for reviewing!

            regards, tom lane

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

28 апреля 2021 г., 08:30:01

Tom Lane has raised a complaint on pgsql-commiters [1] about one of
the commits related to this work [2]. The new member wrasse is showing
Warning:

"/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c",
line 2510: Warning: Likely null pointer dereference (*(curtxn+272)):
ReorderBufferProcessTXN

The Warning is for line:
curtxn->concurrent_abort = true;

Now, we can simply fix this warning by adding an if check like:
if (curtxn)
curtxn->concurrent_abort = true;

However, on further discussion, it seems that is not sufficient here
because the callbacks can throw the surrounding error code
(ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for
a completely different scenario. I think here we need a
stronger check to ensure that we set concurrent abort flag and do
other things in that check only when we are decoding non-committed
xacts. The idea I have is to additionally check that we are decoding
streaming or prepared transaction (the same check as we have for
setting curtxn) or we can check if CheckXidAlive is a valid
transaction id. What do you think?

[1] - https://www.postgresql.org/message-id/2752962.1619568098%40sss.pgh.pa.us
[2] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

28 апреля 2021 г., 08:33:38

On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Tom Lane has raised a complaint on pgsql-commiters [1] about one of
> the commits related to this work [2]. The new member wrasse is showing
> Warning:
>
> "/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c",
> line 2510: Warning: Likely null pointer dereference (*(curtxn+272)):
> ReorderBufferProcessTXN
>
> The Warning is for line:
> curtxn->concurrent_abort = true;
>
> Now, we can simply fix this warning by adding an if check like:
> if (curtxn)
> curtxn->concurrent_abort = true;
>
> However, on further discussion, it seems that is not sufficient here
> because the callbacks can throw the surrounding error code
> (ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for
> a completely different scenario. I think here we need a
> stronger check to ensure that we set concurrent abort flag and do
> other things in that check only when we are decoding non-committed
> xacts.

That makes sense.

 The idea I have is to additionally check that we are decoding
> streaming or prepared transaction (the same check as we have for
> setting curtxn) or we can check if CheckXidAlive is a valid
> transaction id. What do you think?

I think a check based on CheckXidAlive looks good to me.  This will
protect against if a similar error is raised from any other path as
you mentioned above.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

30 апреля 2021 г., 12:31:04

On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
>  The idea I have is to additionally check that we are decoding
> > streaming or prepared transaction (the same check as we have for
> > setting curtxn) or we can check if CheckXidAlive is a valid
> > transaction id. What do you think?
>
> I think a check based on CheckXidAlive looks good to me.  This will
> protect against if a similar error is raised from any other path as
> you mentioned above.
>

We can't use CheckXidAlive because it is reset by that time. So, I
used the other approach which led to the attached.

-- 
With Regards,
Amit Kapila.

Вложения

v1-0001-Tighten-the-concurrent-abort-check-during-decodin.patch

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

30 апреля 2021 г., 17:13:03

On Fri, Apr 30, 2021 at 3:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 28, 2021 at 11:03 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> >  The idea I have is to additionally check that we are decoding
> > > streaming or prepared transaction (the same check as we have for
> > > setting curtxn) or we can check if CheckXidAlive is a valid
> > > transaction id. What do you think?
> >
> > I think a check based on CheckXidAlive looks good to me.  This will
> > protect against if a similar error is raised from any other path as
> > you mentioned above.
> >
>
> We can't use CheckXidAlive because it is reset by that time.

Right.

So, I
> used the other approach which led to the attached.

The patch looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Amit Kapila

Дата:

06 мая 2021 г., 06:31:31

On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> So, I
> > used the other approach which led to the attached.
>
> The patch looks fine to me.
>

Thanks, pushed!

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

От

Dilip Kumar

Дата:

06 мая 2021 г., 11:09:42

On Thu, May 6, 2021 at 9:01 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Apr 30, 2021 at 7:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > So, I
> > > used the other approach which led to the attached.
> >
> > The patch looks fine to me.
> >
>
> Thanks, pushed!

Thanks!



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: PATCH: logical_work_mem and logical streaming of large in-progresstransactions

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения